• Open

    hyperealistic Pepe the frog created by Mid Journeys A.I
    submitted by /u/ExtensionVirtual471 [link] [comments]  ( 86 min )
    Terminator in photo realism, rendered by Mid-Journey’s A.I
    submitted by /u/ExtensionVirtual471 [link] [comments]  ( 86 min )

  • Open

    Awesome...
    submitted by /u/the_anonymizer [link] [comments]  ( 85 min )
    MIT Researchers Develop EquiBind: A Geometric Deep Learning Model That Becomes The Fastest Computational Molecular Docking Models
    There is no denying the importance of new treatments after experiencing one of the worst pandemics, Covid-19. Due to new diseases, medication resistance, and the growing understanding of medical issues, previously incurable disorders can now be treated thanks to drug discovery. There are over 1000000 possible drug-like molecules, and with the existing system, it is difficult to experiment on each of these molecules. Approval procedure needed before drugs can be utilised one of the obstacles to the developing of new drugs. This typically involves a lengthy process lasting up to ten years and costs about 2.5 billion dollars. Additionally, this approach is subject to failure at any time due to unanticipated adverse effects or experimental findings that contradict the claimed therapeutic efficacy. ✅ EquiBind is 1,200 times faster than one of the fastest existing computational molecular docking models, QuickVina2-W, in successfully binding drug-like molecules to proteins ✅ EquiBind is based on its predecessor, EquiDock, which specializes in binding two proteins using a technique developed by the late Octavian-Eugen Ganea. ✅ Code on Github Continue reading | Checkout the paper, github link submitted by /u/ai-lover [link] [comments]  ( 87 min )
    A project I saw this year that could label a website by URL
    I thought it was really cool since it seemed to be crawling a site to identify it. IE: Linkedin would be considered [Online Communities/Social Networks] Does anyone remember seeing it on here? I've been looking for it for hours now. Thank you! submitted by /u/atieonfire [link] [comments]  ( 86 min )
    what ai made this?
    submitted by /u/Noniax [link] [comments]  ( 86 min )
    Hey guys, this is my take on a meme generator. I was tired of working on projects for the industry level with use cases. Hence I tried something funny and it turned out funny. Do check it out. Here is the [github link]( https://github.com/Shreyz-max/Memes-Generator). Suggest me some changes I can t
    submitted by /u/Shreya001 [link] [comments]  ( 86 min )
    I Made a Robot that Slaps my Phone out of My Hand While Driving
    submitted by /u/_ayushp_ [link] [comments]  ( 86 min )
    Annotated Paper - Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models by FAIR
    A detailed and insightful study by MetaAI team on the memorization, overfitting and forgetting in LLMs. The paper talks about how different definitions of "memorization" and how scaling affects the amount of training data that the large language models can memorize during the training phase. Studies are also presented on how the forgetting curves look like and how overfitting relates to memorization for these large language models. The Appendix section is a gold mine as well. Annotated version of the paper - Github Link submitted by /u/shreyansh26 [link] [comments]  ( 86 min )
    Steam punk city created purely by AI
    submitted by /u/ExtensionVirtual471 [link] [comments]  ( 86 min )
    A question I think we should ask AI.
    How come I have not seen anyone ask any of these a I chatbots what some things are that they know that us humans do not know yet submitted by /u/TheRealDinkus [link] [comments]  ( 87 min )
    Researchers from China Propose DAT: a Deformable Vision Transformer to Compute Self-Attention in a Data-Aware Fashion
    In recent years, the extension of transformers in the computer vision field has slowly made vision transformers (ViT) the state-of-the-art model for many topics, such as object detection or image classification. The main reason is their larger receptive field and their ability to model long-term relationships compared to their historical counterparts CNNs. Nevertheless, there are still some drawbacks. In fact, the mechanism of ViT relies on three main matrixes, Query (Q), Key (K), and Value (V). These matrices are used to compute self-attention between the different tokens. In the original paper, the image is split into patches used as tokens. To compute self-attention for the first patch, the Q associated with it is used by comparing it with all the K/V of all the other tokens. In addition, in multi-head attention, multiple sets of matrices build different representations. With this technique, each patch is associated with an insane number of matrices, bringing high computational costs and the risk of overfitting. Continue reading | Check out the paper and github link. submitted by /u/ai-lover [link] [comments]  ( 87 min )
    Industrial AI
    Hello guys i have a question. It is possibe to create app for machine self diagnostis for example if some input sensor fail in the program sequence (press up comand is send but program is still missing input from sensor) it will throw error that program is w8ting for this input sensor to continue for next step in sequence ? submitted by /u/Weary_Expression_225 [link] [comments]  ( 86 min )
    If AI is superior, why bother?
    Now, I'm not gonna pretend that I know all there is to know about AI or what it'll become. That's why I'm here. And I can understand that this seem more like a philosophical question, but I'd like to hear from the people who sides with AI. Why bother? AI is reaching a point where it can do things that humans do, but better and faster. "AI does chess better that the world's greatest chess player", "AI detects things in seconds, where it takes humans months", "AI can have better sex with your wife than you can". That last one was just for comedic reasons, but still you get my point. I get the benefits of AI. Fast problem-solving, accurate calculations, and other aspects that can benefit humanity. However, I feel as though it also defeats any point for humanity to continue. So, I'm an artist. I take what on my mind and make it visual for everyone else. Of course, It is a timely process, but now, AI programs like DALL-E 2 can do what I can do both better and faster with several results. It'll only be a matter of time until it grabs ahold of the animation field, of which I am a part of. If AI can do it without much effort or time, why should I even pick up the pen? It kind of makes me feel like developing AI gives us a reason to downplay humanity and consider it replicable in literally every field and category we work in. Another thought is life after an AI takeover. Say making money and working is no longer an obligation and with AI you can do everything you've ever wanted in a very short amount of time. I'll admit, It does sound nice, but I also feel like it could lead to a very boring existence afterwards. I'm not saying "AI bad cause human now obsolete" despite this whole post sounding like the last three words of that quote. I'm just feeling like AI just destroys any reason to do anything. submitted by /u/ihavenogoodnameatm [link] [comments]  ( 91 min )
    Skunkworks Speculation: Lets see them aliens
    Kidding about aliens. But, can anyone give a good faith attempt at some not-so-conspiracy-theorising as to what is in the skunkworks with AI leaders? What kind of wild scifi shit currently exists but is so epic, few know about it. Maybe even not one Project X, but several. I just know the tech we see is not the leading razor's edge, and I'm super curious what wonders exist beyond the veil. If anyone has any thoughts, I'd be thankful. submitted by /u/Overall-Importance54 [link] [comments]  ( 86 min )
    Have fun and learn AI at the Reinforcement Learning Hackathon on July 23rd!
    One day to immerse yourself in technology that is a first for companies and engineers around the world! To help you begin your immersion in AI as effectively as possible, we've prepared experts to assist you all the way. Not without a competitive component, the winners will receive worthy prizes that will help them successfully use advanced technologies for their projects. So come join us and learn everything you need to know about RL! Register here Reinforcement Learning OpenAI Gym Hackathon submitted by /u/zakrzzz [link] [comments]  ( 86 min )
    Artificial intelligence model finds potential drug molecules a thousand times faster | MIT News | Massachusetts Institute of Technology
    submitted by /u/greentea387 [link] [comments]  ( 86 min )
    IA generated images of Da Vinci blueprints
    submitted by /u/QuillTheBoreal [link] [comments]  ( 86 min )
    How many years away are we from AI creating tailor made video games.
    Just curious. When the era of AI made video games may arrive, I keep thinking it will bring problems in society where people dont socialize as much anymore due to having these games made just for them. Ngl I look forward to such games for myself to lol. submitted by /u/Bitterowner [link] [comments]  ( 86 min )
    Emergent behavior in AI models that looks similar to natural neural systems?
    "ImageNet Classification with Deep Convolutional Neural Networks" by Krizhevsky & Sutskever & Hinton describes very interesting emergent behavior of the AlexNet. It was trained on 2 GPU's: specialization exhibited by the two GPUs ... The kernels on GPU 1 are largely color-agnostic, while the kernels on on GPU 2 are largely color-specific. This kind of specialization occurs during every run and is independent of any particular random weight initialization Likewise our brain mostly processes color with left side of the brain. Are there other examples of emergent behavior in AI models that looks similar to natural neural systems? Any kind from coordination of several neurons to high-level function, useful or detrimental, like optical illusions? So far I found only some articles with optical illusion examples. submitted by /u/vashu11 [link] [comments]  ( 86 min )
    Providing embedded artificial intelligence with a capacity for palimpsest memory storage
    submitted by /u/jormungandrsjig [link] [comments]  ( 86 min )
  • Open

    Continuous and Discrete actions in the same environment
    I'm creating a sort of "flight simulator" environment in openai gym and I want the planes to be able to choose the angle at which they turn. However, the actions right now are turn right, turn left, forward, and shoot. With a continuous action space, I assume I would just need the actions turn or shoot where turn is a float representing the amount to turn. How can I define an action space according to that using gym spaces? My only guess would be to use a dict with a box of (1,) of type float (for turning) and a box of (1,) of type int to shoot. Would that work? I honestly have no idea. submitted by /u/WilliamFlinchbaugh [link] [comments]  ( 87 min )
    Is it possible to prove that an imitation learning agent cannot surpass an expert guide policy in expected reward?
    If you have an expert guide policy in a particular environment and you want to train an agent using imitation learning (the particular method is not that important but perhaps offline imitation learning is the most straightforward) in the same environment using the same reward function, you would expect that the imitation learning agent would (in expectation) be not as successful as the guide policy. I think this to be the case because we can view the imitation learning agent as a sort of degraded version of the guide policy (if we assume that the guide policy is complex enough to not be perfectly mimicked in every state), so there is no reason to believe that it could attain a higher average reward right? Is there any sort of proof for this? Or does anyone have any idea on how you could prove this sort of theorem? Thanks in advance:) submitted by /u/C_BearHill [link] [comments]  ( 90 min )
    GuardAI trial access
    Dear all, During the past few months, we have been working on a platform GuardAI that can assist with testing the security and robustness of AI models. Platform GuardAI allows to simulate wide range of adversarial ML attacks, natural noises, test your own models and datasets. Trial access to the platform is available via the link below: https://www.navinfo.eu/services/cybersecurity/guardai/ We would appreciate your expert opinion about the implementation of the adversarial ML attacks in our platform and your feedback. Should you have any questions or further requests on this platform, please feel free to contact us via: guardaisupport@navinfo.eu Best regards, GuardAI team submitted by /u/GuardAITeam [link] [comments]  ( 86 min )
  • Open

    Hey guys, this is my take on a meme generator. I was tired of working on projects for the industry level with use cases. Hence I tried something and it turned out funny. Do check it out. Github link in the comments. Suggest me some changes as well.
    submitted by /u/Shreya001 [link] [comments]  ( 86 min )
  • Open

    [D] How do you deal with skewed continuous target variables?
    I am trying to build a model that predicts an extremely skewed target variable. My independent variables have a low correlation with my dependent variable which is highly skewed (2% of the data are extremely higher than the rest which causes my model to make high predictions) submitted by /u/Ok_Challenge1987 [link] [comments]  ( 88 min )
    [R] Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models by FAIR
    A detailed and insightful study by MetaAI team on the memorization, overfitting and forgetting in LLMs. The paper talks about how different definitions of "memorization" and how scaling affects the amount of training data that the large language models can memorize during the training phase. Studies are also presented on how the forgetting curves look like and how overfitting relates to memorization for these large language models. The Appendix section is a gold mine as well. Annotated version of the paper - Github Link submitted by /u/shreyansh26 [link] [comments]  ( 87 min )
    [P] nbsnapshot: Automated Jupyter notebook testing. 📙
    https://preview.redd.it/qgfg81lp4sb91.png?width=1201&format=png&auto=webp&s=ba15963a42a85a7f18a5a173d167774d7f5b141d I want to share a project I've been working on to facilitate Jupyter notebook testing! When analyzing data in a Jupyter notebook, I unconsciously memorize "rules of thumb" to determine if my results are correct. For example, I might print some summary statistics and become skeptical of some outputs if they deviate too much from what I've seen historically. For more complex analysis, I often create diagnostic plots (e.g., a histogram) and check them whenever new data arrives. Since I constantly repeat the same process, I figured I'd code a small library to streamline this process. nbsnapshot benchmarks cell's outputs with historical results and raises an error if the output deviates from an expected range (by default, 3 standard deviations from the mean). You can see an example in the image accompanying this post. To learn more, check out the blog post. submitted by /u/ploomber-io [link] [comments]  ( 88 min )
    [D] At what point does data augmentation stop making a difference for language models?
    Is there any work which shows at what point does data augmentation stops making a difference? Say you have GPT-3 type data then you probably don't get gains from data augmentation but for low-data regime you definitely get gains. Is there a systematic study what gets to the bottom of this? submitted by /u/Economy-Pipe-6184 [link] [comments]  ( 87 min )
    [P] Feedback for our model evaluation and interpretability platform, $100 gift card for your time!
    Hi everyone! I'm Gabriel Bayomi, one of the founders of Unbox (https://unbox.ai) and an ML engineer myself. Over the years, we learned that ML model evaluation is a huge challenge, so we started Unbox to make it easy for ML teams to find failures and biases in their models, figure out their root causes and use better data to fix them. We’re launching the alpha version of our new community edition (free!), and we’d love to get feedback on our product before doing our beta launch. We’ll be giving a $100 gift card to folks who are willing to give some time to this and provide feedback on the usability of the product. It’s super easy, you just need to sign-up here: https://unbox.ai/alpha?ref=reddit. Look forward to hearing from you! You can always email me with any questions ([gabriel@unbox.ai](mailto:gabriel@unbox.ai) - community Slack coming soon) or throw them out here in the thread. Gabriel submitted by /u/byebaybay [link] [comments]  ( 92 min )
    [R] RWKV-3: Scaling RNN to 1.5B and Reach Transformer LM Performance (without using attention)
    Hi everyone. I posted about my RWKV-2 here a few weeks ago (thanks for the upvote): https://www.reddit.com/r/MachineLearning/comments/veem7o/r_rwkv2_430m_release_a_parallelizable_rnn_with/ And RWKV-3 is better. You are welcome to join the project: https://github.com/BlinkDL/RWKV-LM (I am an independent researcher). The LM (language modeling) and zero-shot performances of RWKV-3 1.5B, after training for just 93B tokens (the full run of 330B tokens is expected to finish in 60 more days, on 8xA100 tf32): https://preview.redd.it/5pqa3iu6orb91.png?width=1068&format=png&auto=webp&s=89f40c6e9967d76d83050af0f5fb9f1b992f4323 RWKV-3 is a 100% pure RNN (the next hidden state depends only on the current hidden state). Hence, RNN might be all you need. Download the 68B-tokens checkpoint: https://huggingface.co/BlinkDL/rwkv-3-pile-1b5 Inference speed on single A40 (tf32): *) RWKV-3 1.5B = always 0.015 sec/token - tested using simple pytorch code (no CUDA), GPU utilization 45%, VRAM 7823M *) GPT2-XL 1.3B = 0.032 sec/token (for ctxlen 1000) - tested using HF, GPU utilization 45% too (interesting), VRAM 9655M How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's simple once you understand it. Here are some of the TODOs. Let's work together :) https://github.com/BlinkDL/RWKV-LM *) FP16 inference & training, and scaling to 6B -> 20B -> 66B (there will be compute when we have the infrastructure). RWKV is very scalable if we look at the 169M-430M-1.5B results. *) HuggingFace integration, and optimized CPU & iOS & Android & WASM & WebGL inference. RWKV is friendly for edge devices. Let's make it possible to run a LLM on your phone. *) Test it on bidirectional & MLM tasks, and image & audio & video tokens. submitted by /u/bo_peng [link] [comments]  ( 90 min )
    [D] What’s the latest in ML music generation?
    I have been peripherally interested in ML music generation models over the years and followed Google’s magenta project. Recently I’m trying to get up to speed with what’s been happening in this space. Looking at magenta’s homepage, it seems like the last paper they published was Listen to transformers, about 2 years ago. Does anyone know what’s been going on recently? Why off magenta dead? Has any other company/lab been working on this and open source their models? submitted by /u/iamjaiyam [link] [comments]  ( 89 min )
    [D] [Discussion] Gaussian Distribuction - PhD Level Very tricky equations
    The following dissertation comes from page 83 of Bishop's Book, which is reachable through this link: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf?fbclid=IwAR3ZLtsUFTetN7wgLmxoRt5R32tF-OHuQhlnqt9lPy_ldKuLChv4BCZm-2I I found the equation in (2.61) very tricky and no one in my PhD Lab. was able to find out how they come up. Is there someone able to clarify this mathematically? Thank you in advance. https://preview.redd.it/tir533r7qqb91.png?width=748&format=png&auto=webp&s=686cec850580ed8496da3c4bd320a595d0dabc98 submitted by /u/ProfitCute5415 [link] [comments]  ( 88 min )
    [P] A python module to fetch relevant papers based on keywords from different sources, including Arxiv, ACL, ACM, PMLR, CVF etc. and fetch all citations of a research paper from google scholar
    Hi folks, I was working on a personal experimental project, which I thought of making it open source now. It saves much time for literature research. If you are an industrial researcher or in academia, you probably spend much time reading research articles and news related to your topic. If you try to search papers related to your topic, finding relevant documents on the internet takes time. You probably know the pain of extracting citations of articles from different websites. Previously I used to fetch papers from google or semantic scholar, but semantic scholar does not show correct paper citations. I am excited to announce RESP: Research Papers Search Features: Fetch all citations of a single paper from Google Scholar in CSV format Fetch all related papers of a single paper from Google Scholar in CSV format Fetch all connected papers from connectedpapers.com (it does not use a citation tree, it uses similarity to build graphs) in CSV format Fetch relevant papers based on keywords from different sources, including Arxiv, ACL, ACM, PMLR, NeurIPS, cvf etc., in CSV format GITHUB: https://github.com/monk1337/resp Examples: https://github.com/monk1337/resp/tree/main/examples I hope it will be helpful in your research. Thanks :) submitted by /u/aadityaura [link] [comments]  ( 89 min )
    [D] Does anyone else feel that machine learning papers are getting very "wordy"?
    So I was looking at which papers cited "On the Opportunities and Risks of Foundation Models" (which is a very wordy paper), using Google scholar, when I realized that most of the papers citing it are also very wordy. I don't know any of the authors here, and I'm just picking a few that I saw: https://arxiv.org/pdf/2202.07096.pdf https://arxiv.org/pdf/2109.07573.pdf https://arxiv.org/pdf/2203.07785.pdf https://arxiv.org/pdf/2111.07765.pdf https://arxiv.org/pdf/2111.15366.pdf https://arxiv.org/pdf/2109.08270.pdf https://arxiv.org/pdf/2110.15444.pdf https://arxiv.org/pdf/2205.00538.pdf Now a lot of these papers look interesting, but I am turned off by just wall after wall of text (usually not self-contained and references a bunch of prior work). Does anyone else feel like this is becoming a trend in ML? Or are these kind of style the norm in this field? submitted by /u/fromnighttilldawn [link] [comments]  ( 91 min )
    [D] 2nd AutoML Fall School
    With last years interest of this subreddit on our AutoML Fall School, we are happy to announce the second AutoML Fall School which will be in-person in Freiburg (Germany) from October 10th to October 13th. AutoML can be a vital tool for many machine learning practitioners and researchers. While students and professionals are eager to learn more about AutoML, it is rarely taught and addressed in courses in today’s academic landscape. With the the AutoML Fall School we aim to close this glaring gap by providing a platform for graduate students and researchers to learn about core aspects of AutoML. The event will feature lectures and invited talks by renowned experts about topics from fundamental theory to advanced state-of-the-art methods and current challenges such as neural architecture search and automated reinforcement learning. Additionally, you will be able to try your hands at implementing leading AutoML solutions in our hands-on sessions while being mentored by AutoML experts as well as network and exchange ideas in our social events and much more. Registrations are now open! Find a preliminary schedule, additional information, and the registration details on our official website. In case you need even more motivation. The city of Freiburg im Breisgau, where we will host the venue, was ranked 3rd in the category Top 10 Cities in "Best in Travel 2022" by lonely planet. We are looking forward to seeing you in October in Germany's greenest and sunniest city. submitted by /u/Science_Squid [link] [comments]  ( 88 min )
    [D] Resources on writing reproducible code?
    What are some resources on writing reproducible (excluding ofc floating point addition randomness, etc) ML code? So far, I'm following pretty standard software engineering principles that I learned in class: documentation comments, modularization, function deduplication, READMEs. I'm planning to write unit tests for some of my preprocessing steps as well. But there is a whole class of other factors that affect the code: the GPU I'm using, system parameters, etc. Short of just listing the computer specs, is there any easy way to perhaps bundle things like drivers into a git repo? submitted by /u/ElectronicCress3132 [link] [comments]  ( 89 min )
    [D] What are people using to organize large groups of people for data labelling?
    I'm thinking of hiring a bunch of people to label a ton of data. What is the best software to do this? I specifically want to use my own labelers. submitted by /u/vanilla-acc [link] [comments]  ( 94 min )
    [P] How to tackle Time-series Classification with a large number of categorical variables/attributes ( >100) with high cardinality? I'm open to discussing other ways as well.
    I am predicting whether the particular event would occur or not in the next n-timeframes given the categorical variables with high cardinality. Please let me know if there is anything that we can do to tackle this problem. submitted by /u/madlad612 [link] [comments]  ( 89 min )
    [D] ML architecture for adaptive setting suggestions in a stage-dependent program
    I've got a problem that I could use some insight on. ​ Summary: I need to design a ML architecture that suggests parameters to a program depending on the observed performance of the user. The program has a set number of stages and the objective of the program is to improve user performance as much as possible within the stage limit. ​ Problem description: To provide an example, let's say we have 3 stages in the program and the user starts off at stage 1. The program takes two parameters at each stage that determine the difficulty of that stage: Alpha and Beta, which both range from [0, 10], inclusive. The user completes stage 1 and a summarization score on user performance is returned based on a radar chart produced by the program. For this example, let's say the score is a 3 out of…  ( 89 min )
  • Open

    Build a news-based real-time alert system with Twitter, Amazon SageMaker, and Hugging Face
    Today, social media is a huge source of news. Users rely on platforms like Facebook and Twitter to consume news. For certain industries such as insurance companies, first respondents, law enforcement, and government agencies, being able to quickly process news about relevant events occurring can help them take action while these events are still unfolding. […]  ( 9 min )
  • Open

    Gaussian elimination
    When you solve systems of linear equations, you probably use Gaussian elimination, even if you don’t call it that. You may learn Gaussian elimination before you see it formalized in terms of matrices. So if you’ve had a course in linear algebra, and you sign up for a course in numerical linear algebra, it’s natural […] Gaussian elimination first appeared on John D. Cook.  ( 5 min )
  • Open

    Meet the Omnivore: Animator Entertains and Explains With NVIDIA Omniverse
    Australian animator Marko Matosevic is taking jokes from a children’s school dads’ group and breathing them into animated life with NVIDIA Omniverse, a virtual world simulation and collaboration platform for 3D workflows. The post Meet the Omnivore: Animator Entertains and Explains With NVIDIA Omniverse appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    The Kitten Effect
    One thing I've noticed with image-generating algorithms is that the more of something they have to put in an image, the worse it is. I first noticed this with the kitten-generating variant of StyleGAN, which often does okay on one cat: alternative for shocked_pikachu.png but is  ( 4 min )
    Bonus: How closely can I look at a giraffe?
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Top AI Resources You Must Follow If You Are Into AI
    How to keep up with the latest machine learning advancements  ( 14 min )
  • Open

    Several Approximation Algorithms for Sparse Best Rank-1 Approximation to Higher-Order Tensors. (arXiv:2012.03092v2 [math.NA] UPDATED)
    Sparse tensor best rank-1 approximation (BR1Approx), which is a sparsity generalization of the dense tensor BR1Approx, and is a higher-order extension of the sparse matrix BR1Approx, is one of the most important problems in sparse tensor decomposition and related problems arising from statistics and machine learning. By exploiting the multilinearity as well as the sparsity structure of the problem, four approximation algorithms are proposed, which are easily implemented, of low computational complexity, and can serve as initial procedures for iterative algorithms. In addition, theoretically guaranteed worst-case approximation lower bounds are proved for all the algorithms. We provide numerical experiments on synthetic and real data to illustrate the effectiveness of the proposed algorithms.  ( 2 min )
    Wide Neural Networks Forget Less Catastrophically. (arXiv:2110.11526v3 [cs.LG] UPDATED)
    A primary focus area in continual learning research is alleviating the "catastrophic forgetting" problem in neural networks by designing new algorithms that are more robust to the distribution shifts. While the recent progress in continual learning literature is encouraging, our understanding of what properties of neural networks contribute to catastrophic forgetting is still limited. To address this, instead of focusing on continual learning algorithms, in this work, we focus on the model itself and study the impact of "width" of the neural network architecture on catastrophic forgetting, and show that width has a surprisingly significant effect on forgetting. To explain this effect, we study the learning dynamics of the network from various perspectives such as gradient orthogonality, sparsity, and lazy training regime. We provide potential explanations that are consistent with the empirical results across different architectures and continual learning benchmarks.  ( 2 min )
    Lipschitz Continuity Retained Binary Neural Network. (arXiv:2207.06540v1 [cs.LG])
    Relying on the premise that the performance of a binary neural network can be largely restored with eliminated quantization error between full-precision weight vectors and their corresponding binary vectors, existing works of network binarization frequently adopt the idea of model robustness to reach the aforementioned objective. However, robustness remains to be an ill-defined concept without solid theoretical support. In this work, we introduce the Lipschitz continuity, a well-defined functional property, as the rigorous criteria to define the model robustness for BNN. We then propose to retain the Lipschitz continuity as a regularization term to improve the model robustness. Particularly, while the popular Lipschitz-involved regularization methods often collapse in BNN due to its extreme sparsity, we design the Retention Matrices to approximate spectral norms of the targeted weight matrices, which can be deployed as the approximation for the Lipschitz constant of BNNs without the exact Lipschitz constant computation (NP-hard). Our experiments prove that our BNN-specific regularization method can effectively strengthen the robustness of BNN (testified on ImageNet-C), achieving state-of-the-art performance on CIFAR and ImageNet.  ( 2 min )
    Distance Learner: Incorporating Manifold Prior to Model Training. (arXiv:2207.06888v1 [cs.LG])
    The manifold hypothesis (real world data concentrates near low-dimensional manifolds) is suggested as the principle behind the effectiveness of machine learning algorithms in very high dimensional problems that are common in domains such as vision and speech. Multiple methods have been proposed to explicitly incorporate the manifold hypothesis as a prior in modern Deep Neural Networks (DNNs), with varying success. In this paper, we propose a new method, Distance Learner, to incorporate this prior for DNN-based classifiers. Distance Learner is trained to predict the distance of a point from the underlying manifold of each class, rather than the class label. For classification, Distance Learner then chooses the class corresponding to the closest predicted class manifold. Distance Learner can also identify points as being out of distribution (belonging to neither class), if the distance to the closest manifold is higher than a threshold. We evaluate our method on multiple synthetic datasets and show that Distance Learner learns much more meaningful classification boundaries compared to a standard classifier. We also evaluate our method on the task of adversarial robustness, and find that it not only outperforms standard classifier by a large margin, but also performs at par with classifiers trained via state-of-the-art adversarial training.  ( 2 min )
    Language models show human-like content effects on reasoning. (arXiv:2207.07051v1 [cs.CL])
    Abstract reasoning is a key ability for an intelligent system. Large language models achieve above-chance performance on abstract reasoning tasks, but exhibit many imperfections. However, human abstract reasoning is also imperfect, and depends on our knowledge and beliefs about the content of the reasoning problem. For example, humans reason much more reliably about logical rules that are grounded in everyday situations than arbitrary rules about abstract attributes. The training experiences of language models similarly endow them with prior expectations that reflect human knowledge and beliefs. We therefore hypothesized that language models would show human-like content effects on abstract reasoning problems. We explored this hypothesis across three logical reasoning tasks: natural language inference, judging the logical validity of syllogisms, and the Wason selection task (Wason, 1968). We find that state of the art large language models (with 7 or 70 billion parameters; Hoffman et al., 2022) reflect many of the same patterns observed in humans across these tasks -- like humans, models reason more effectively about believable situations than unrealistic or abstract ones. Our findings have implications for understanding both these cognitive effects, and the factors that contribute to language model performance.  ( 2 min )
    problexity -- an open-source Python library for binary classification problem complexity assessment. (arXiv:2207.06709v1 [cs.LG])
    The classification problem's complexity assessment is an essential element of many topics in the supervised learning domain. It plays a significant role in meta-learning -- becoming the basis for determining meta-attributes or multi-criteria optimization -- allowing the evaluation of the training set resampling without needing to rebuild the recognition model. The tools currently available for the academic community, which would enable the calculation of problem complexity measures, are available only as libraries of the C++ and R languages. This paper describes the software module that allows for the estimation of 22 complexity measures for the Python language -- compatible with the scikit-learn programming interface -- allowing for the implementation of research using them in the most popular programming environment of the machine learning community.  ( 2 min )
    Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models. (arXiv:2207.06950v1 [stat.ML])
    Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI-Tree, that is similar to EBM, but has a number of features that lead to better performance. It uses model-based trees as base learners and incorporates a new interaction filtering method that is better at capturing the underlying interactions. In addition, our iterative training method converges to a model with better predictive performance, and the embedded purification ensures that interactions are hierarchically orthogonal to main effects. The algorithm does not need extensive tuning, and our implementation is fast and efficient. We use simulated and real datasets to compare the performance and interpretability of GAMI-Tree with EBM and GAMI-Net.  ( 2 min )
    Multitrack Music Transformer: Learning Long-Term Dependencies in Music with Diverse Instruments. (arXiv:2207.06983v1 [cs.SD])
    Existing approaches for generating multitrack music with transformer models have been limited to either a small set of instruments or short music segments. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations for multitrack music. In this work, we propose a compact representation that allows a diverse set of instruments while keeping a short sequence length. Using our proposed representation, we present the Multitrack Music Transformer (MTMT) for learning long-term dependencies in multitrack music. In a subjective listening test, our proposed model achieves competitive quality on unconditioned generation against two baseline models. We also show that our proposed model can generate samples that are twice as long as those produced by the baseline models, and, further, can do so in half the inference time. Moreover, we propose a new measure for analyzing musical self-attentions and show that the trained model learns to pay less attention to notes that form a dissonant interval with the current note, yet attending more to notes that are 4N beats away from current. Finally, our findings provide a novel foundation for future work exploring longer-form multitrack music generation and improving self-attentions for music. All source code and audio samples can be found at https://salu133445.github.io/mtmt/ .  ( 3 min )
    Verification of Sigmoidal Artificial Neural Networks using iSAT. (arXiv:2207.06755v1 [cs.AI])
    This paper presents an approach for verifying the behaviour of nonlinear Artificial Neural Networks (ANNs) found in cyber-physical safety-critical systems. We implement a dedicated interval constraint propagator for the sigmoid function into the SMT solver iSAT and compare this approach with a compositional approach encoding the sigmoid function by basic arithmetic features available in iSAT and an approximating approach. Our experimental results show that the dedicated and the compositional approach clearly outperform the approximating approach. Throughout all our benchmarks, the dedicated approach showed an equal or better performance compared to the compositional approach.  ( 2 min )
    A Personalized Zero-Shot ECG Arrhythmia Monitoring System: From Sparse Representation Based Domain Adaption to Energy Efficient Abnormal Beat Detection for Practical ECG Surveillance. (arXiv:2207.07089v1 [cs.LG])
    This paper proposes a low-cost and highly accurate ECG-monitoring system intended for personalized early arrhythmia detection for wearable mobile sensors. Earlier supervised approaches for personalized ECG monitoring require both abnormal and normal heartbeats for the training of the dedicated classifier. However, in a real-world scenario where the personalized algorithm is embedded in a wearable device, such training data is not available for healthy people with no cardiac disorder history. In this study, (i) we propose a null space analysis on the healthy signal space obtained via sparse dictionary learning, and investigate how a simple null space projection or alternatively regularized least squares-based classification methods can reduce the computational complexity, without sacrificing the detection accuracy, when compared to sparse representation-based classification. (ii) Then we introduce a sparse representation-based domain adaptation technique in order to project other existing users' abnormal and normal signals onto the new user's signal space, enabling us to train the dedicated classifier without having any abnormal heartbeat of the new user. Therefore, zero-shot learning can be achieved without the need for synthetic abnormal heartbeat generation. An extensive set of experiments performed on the benchmark MIT-BIH ECG dataset shows that when this domain adaptation-based training data generator is used with a simple 1-D CNN classifier, the method outperforms the prior work by a significant margin. (iii) Then, by combining (i) and (ii), we propose an ensemble classifier that further improves the performance. This approach for zero-shot arrhythmia detection achieves an average accuracy level of 98.2% and an F1-Score of 92.8%. Finally, a personalized energy-efficient ECG monitoring scheme is proposed using the above-mentioned innovations.  ( 3 min )
    On the Strong Correlation Between Model Invariance and Generalization. (arXiv:2207.07065v1 [cs.LG])
    Generalization and invariance are two essential properties of any machine learning model. Generalization captures a model's ability to classify unseen data while invariance measures consistency of model predictions on transformations of the data. Existing research suggests a positive relationship: a model generalizing well should be invariant to certain visual factors. Building on this qualitative implication we make two contributions. First, we introduce effective invariance (EI), a simple and reasonable measure of model invariance which does not rely on image labels. Given predictions on a test image and its transformed version, EI measures how well the predictions agree and with what level of confidence. Second, using invariance scores computed by EI, we perform large-scale quantitative correlation studies between generalization and invariance, focusing on rotation and grayscale transformations. From a model-centric view, we observe generalization and invariance of different models exhibit a strong linear relationship, on both in-distribution and out-of-distribution datasets. From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets. Apart from these major findings, other minor but interesting insights are also discussed.  ( 2 min )
    An Asymmetric Contrastive Loss for Handling Imbalanced Datasets. (arXiv:2207.07080v1 [cs.LG])
    Contrastive learning is a representation learning method performed by contrasting a sample to other similar samples so that they are brought closely together, forming clusters in the feature space. The learning process is typically conducted using a two-stage training architecture, and it utilizes the contrastive loss (CL) for its feature learning. Contrastive learning has been shown to be quite successful in handling imbalanced datasets, in which some classes are overrepresented while some others are underrepresented. However, previous studies have not specifically modified CL for imbalanced datasets. In this work, we introduce an asymmetric version of CL, referred to as ACL, in order to directly address the problem of class imbalance. In addition, we propose the asymmetric focal contrastive loss (AFCL) as a further generalization of both ACL and focal contrastive loss (FCL). Results on the FMNIST and ISIC 2018 imbalanced datasets show that AFCL is capable of outperforming CL and FCL in terms of both weighted and unweighted classification accuracies. In the appendix, we provide a full axiomatic treatment on entropy, along with complete proofs.  ( 2 min )
    Likelihood Training of Schr\"odinger Bridge using Forward-Backward SDEs Theory. (arXiv:2110.11291v4 [stat.ML] UPDATED)
    Schr\"odinger Bridge (SB) is an entropy-regularized optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory - a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10. Our code is available at https://github.com/ghliu/SB-FBSDE.  ( 3 min )
    Work In Progress: Safety and Robustness Verification of Autoencoder-Based Regression Models using the NNV Tool. (arXiv:2207.06759v1 [cs.LG])
    This work in progress paper introduces robustness verification for autoencoder-based regression neural network (NN) models, following state-of-the-art approaches for robustness verification of image classification NNs. Despite the ongoing progress in developing verification methods for safety and robustness in various deep neural networks (DNNs), robustness checking of autoencoder models has not yet been considered. We explore this open space of research and check ways to bridge the gap between existing DNN verification methods by extending existing robustness analysis methods for such autoencoder networks. While classification models using autoencoders work more or less similar to image classification NNs, the functionality of regression models is distinctly different. We introduce two definitions of robustness evaluation metrics for autoencoder-based regression models, specifically the percentage robustness and un-robustness grade. We also modified the existing Imagestar approach, adjusting the variables to take care of the specific input types for regression networks. The approach is implemented as an extension of NNV, then applied and evaluated on a dataset, with a case study experiment shown using the same dataset. As per the authors' understanding, this work in progress paper is the first to show possible reachability analysis of autoencoder-based NNs.  ( 3 min )
    Noise-Stable Rigid Graphs for Euclidean Embedding. (arXiv:1907.06441v5 [cs.CG] UPDATED)
    We proposed a new criterion \textit{noise-stability}, which revised the classical rigidity theory, for evaluation of MDS algorithms which can truthfully represent the fidelity of global structure reconstruction; then we proved the noise-stability of the cMDS algorithm in generic conditions, which provides a rigorous theoretical guarantee for the precision and theoretical bounds for Euclidean embedding and its application in fields including wireless sensor network localization and satellite positioning. Furthermore, we looked into previous work about minimum-cost globally rigid spanning subgraph, and proposed an algorithm to construct a minimum-cost noise-stable spanning graph in the Euclidean space, which enabled reliable localization on sparse graphs of noisy distance constraints with linear numbers of edges and sublinear costs in total edge lengths. Additionally, this algorithm also suggests a scheme to reconstruct point clouds from pairwise distances at a minimum of $O(n)$ time complexity, down from $O(n^3)$ for cMDS.  ( 2 min )
    Data-Free Neural Architecture Search via Recursive Label Calibration. (arXiv:2112.02086v2 [cs.LG] UPDATED)
    This paper aims to explore the feasibility of neural architecture search (NAS) given only a pre-trained model without using any original training data. This is an important circumstance for privacy protection, bias avoidance, etc., in real-world scenarios. To achieve this, we start by synthesizing usable data through recovering the knowledge from a pre-trained deep neural network. Then we use the synthesized data and their predicted soft-labels to guide neural architecture search. We identify that the NAS task requires the synthesized data (we target at image domain here) with enough semantics, diversity, and a minimal domain gap from the natural images. For semantics, we propose recursive label calibration to produce more informative outputs. For diversity, we propose a regional update strategy to generate more diverse and semantically-enriched synthetic data. For minimal domain gap, we use input and feature-level regularization to mimic the original data distribution in latent space. We instantiate our proposed framework with three popular NAS algorithms: DARTS, ProxylessNAS and SPOS. Surprisingly, our results demonstrate that the architectures discovered by searching with our synthetic data achieve accuracy that is comparable to, or even higher than, architectures discovered by searching from the original ones, for the first time, deriving the conclusion that NAS can be done effectively with no need of access to the original or called natural data if the synthesis method is well designed.  ( 3 min )
    Bootstrapped Masked Autoencoders for Vision BERT Pretraining. (arXiv:2207.07116v1 [cs.CV])
    We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features as the BERT prediction target for masked tokens can achieve better pretraining performance. Therefore, we add a momentum encoder in parallel with the original MAE encoder, which bootstraps the pretraining performance by using its own representation as the BERT prediction target. In the second design, we introduce target-specific information (e.g., pixel values of unmasked patches) from the encoder directly to the decoder to reduce the pressure on the encoder of memorizing the target-specific information. Thus, the encoder focuses on semantic modeling, which is the goal of BERT pretraining, and does not need to waste its capacity in memorizing the information of unmasked tokens related to the prediction target. Through extensive experiments, our BootMAE achieves $84.2\%$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming MAE by $+0.8\%$ under the same pre-training epochs. BootMAE also gets $+1.0$ mIoU improvements on semantic segmentation on ADE20K and $+1.3$ box AP, $+1.4$ mask AP improvement on object detection and segmentation on COCO dataset. Code is released at https://github.com/LightDXY/BootMAE.  ( 3 min )
    Graph Modularity: Towards Understanding the Cross-Layer Transition of Feature Representations in Deep Neural Networks. (arXiv:2111.12485v2 [cs.CV] UPDATED)
    There are good arguments to support the claim that deep neural networks (DNNs) capture better feature representations than the previous hand-crafted feature engineering, which leads to a significant performance improvement. In this paper, we move a tiny step towards understanding the dynamics of feature representations over layers. Specifically, we model the process of class separation of intermediate representations in pre-trained DNNs as the evolution of communities in dynamic graphs. Then, we introduce modularity, a generic metric in graph theory, to quantify the evolution of communities. In the preliminary experiment, we find that modularity roughly tends to increase as the layer goes deeper and the degradation and plateau arise when the model complexity is great relative to the dataset. Through an asymptotic analysis, we prove that modularity can be broadly used for different applications. For example, modularity provides new insights to quantify the difference between feature representations. More crucially, we demonstrate that the degradation and plateau in modularity curves represent redundant layers in DNNs and can be pruned with minimal impact on performance, which provides theoretical guidance for layer pruning. Our code is available at https://github.com/yaolu-zjut/Dynamic-Graphs-Construction.  ( 3 min )
    Fully Decentralized Model-based Policy Optimization for Networked Systems. (arXiv:2207.06559v1 [cs.LG])
    Reinforcement learning algorithms require a large amount of samples; this often limits their real-world applications on even simple tasks. Such a challenge is more outstanding in multi-agent tasks, as each step of operation is more costly requiring communications or shifting or resources. This work aims to improve data efficiency of multi-agent control by model-based learning. We consider networked systems where agents are cooperative and communicate only locally with their neighbors, and propose the decentralized model-based policy optimization framework (DMPO). In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts. To alleviate the bias of model-generated data, we restrain the model usage for generating myopic rollouts, thus reducing the compounding error of model generation. To pertain the independence of policy update, we introduce extended value function and theoretically prove that the resulting policy gradient is a close approximation to true policy gradients. We evaluate our algorithm on several benchmarks for intelligent transportation systems, which are connected autonomous vehicle control tasks (Flow and CACC) and adaptive traffic signal control (ATSC). Empirically results show that our method achieves superior data efficiency and matches the performance of model-free methods using true models.  ( 3 min )
    Discovery of New Multi-Level Features for Domain Generalization via Knowledge Corruption. (arXiv:2109.04320v2 [cs.LG] UPDATED)
    Machine learning models that can generalize to unseen domains are essential when applied in real-world scenarios involving strong domain shifts. We address the challenging domain generalization (DG) problem, where a model trained on a set of source domains is expected to generalize well in unseen domains without any exposure to their data. The main challenge of DG is that the features learned from the source domains are not necessarily present in the unseen target domains, leading to performance deterioration. We assume that learning a richer set of features is crucial to improve the transfer to a wider set of unknown domains. For this reason, we propose COLUMBUS, a method that enforces new feature discovery via a targeted corruption of the most relevant input and multi-level representations of the data. We conduct an extensive empirical evaluation to demonstrate the effectiveness of the proposed approach which achieves new state-of-the-art results by outperforming 18 DG algorithms on multiple DG benchmark datasets in the DomainBed framework.
    HyGNN: Drug-Drug Interaction Prediction via Hypergraph Neural Network. (arXiv:2206.12747v2 [q-bio.QM] UPDATED)
    Drug-Drug Interactions (DDIs) may hamper the functionalities of drugs, and in the worst scenario, they may lead to adverse drug reactions (ADRs). Predicting all DDIs is a challenging and critical problem. Most existing computational models integrate drug-centric information from different sources and leverage them as features in machine learning classifiers to predict DDIs. However, these models have a high chance of failure, especially for the new drugs when all the information is not available. This paper proposes a novel Hypergraph Neural Network (HyGNN) model based on only the SMILES string of drugs, available for any drug, for the DDI prediction problem. To capture the drug similarities, we create a hypergraph from drugs' chemical substructures extracted from the SMILES strings. Then, we develop HyGNN consisting of a novel attention-based hypergraph edge encoder to get the representation of drugs as hyperedges and a decoder to predict the interactions between drug pairs. Furthermore, we conduct extensive experiments to evaluate our model and compare it with several state-of-the-art methods. Experimental results demonstrate that our proposed HyGNN model effectively predicts DDIs and impressively outperforms the baselines with a maximum ROC-AUC and PR-AUC of 97.9% and 98.1%, respectively.
    Continuous-time Analysis for Variational Inequalities: An Overview and Desiderata. (arXiv:2207.07105v1 [stat.ML])
    Algorithms that solve zero-sum games, multi-objective agent objectives, or, more generally, variational inequality (VI) problems are notoriously unstable on general problems. Owing to the increasing need for solving such problems in machine learning, this instability has been highlighted in recent years as a significant research challenge. In this paper, we provide an overview of recent progress in the use of continuous-time perspectives in the analysis and design of methods targeting the broad VI problem class. Our presentation draws parallels between single-objective problems and multi-objective problems, highlighting the challenges of the latter. We also formulate various desiderata for algorithms that apply to general VIs and we argue that achieving these desiderata may profit from an understanding of the associated continuous-time dynamics.
    AGIC: Approximate Gradient Inversion Attack on Federated Learning. (arXiv:2204.13784v3 [cs.LG] UPDATED)
    Federated learning is a private-by-design distributed learning paradigm where clients train local models on their own data before a central server aggregates their local updates to compute a global model. Depending on the aggregation method used, the local updates are either the gradients or the weights of local learning models. Recent reconstruction attacks apply a gradient inversion optimization on the gradient update of a single minibatch to reconstruct the private data used by clients during training. As the state-of-the-art reconstruction attacks solely focus on single update, realistic adversarial scenarios are overlooked, such as observation across multiple updates and updates trained from multiple mini-batches. A few studies consider a more challenging adversarial scenario where only model updates based on multiple mini-batches are observable, and resort to computationally expensive simulation to untangle the underlying samples for each local step. In this paper, we propose AGIC, a novel Approximate Gradient Inversion Attack that efficiently and effectively reconstructs images from both model or gradient updates, and across multiple epochs. In a nutshell, AGIC (i) approximates gradient updates of used training samples from model updates to avoid costly simulation procedures, (ii) leverages gradient/model updates collected from multiple epochs, and (iii) assigns increasing weights to layers with respect to the neural network structure for reconstruction quality. We extensively evaluate AGIC on three datasets, CIFAR-10, CIFAR-100 and ImageNet. Our results show that AGIC increases the peak signal-to-noise ratio (PSNR) by up to 50% compared to two representative state-of-the-art gradient inversion attacks. Furthermore, AGIC is faster than the state-of-the-art simulation based attack, e.g., it is 5x faster when attacking FedAvg with 8 local steps in between model updates.
    A comparison of latent semantic analysis and correspondence analysis of document-term matrices. (arXiv:2108.06197v3 [cs.IR] UPDATED)
    Latent semantic analysis (LSA) and correspondence analysis (CA) are two techniques that use a singular value decomposition (SVD) for dimensionality reduction. LSA has been extensively used to obtain low-dimensional representations that capture relationships among documents and terms. In this article, we present a theoretical analysis and comparison of the two techniques in the context of document-term matrices. We show that CA has some attractive properties as compared to LSA, for instance that effects of margins arising from differing document-lengths and term-frequencies are effectively eliminated, so that the CA solution is optimally suited to focus on relationships among documents and terms. A unifying framework is proposed that includes both CA and LSA as special cases. We empirically compare CA to various LSA based methods on text categorization in English and authorship attribution on historical Dutch texts, and find that CA performs significantly better. We also apply CA to a long-standing question regarding the authorship of the Dutch national anthem Wilhelmus and provide further support that it can be attributed to the author Datheen, amongst several contenders.
    A survey on domain adaptation theory: learning bounds and theoretical guarantees. (arXiv:2004.11829v6 [cs.LG] UPDATED)
    All famous machine learning algorithms that comprise both supervised and semi-supervised learning work well only under a common assumption: the training and test data follow the same distribution. When the distribution changes, most statistical models must be reconstructed from newly collected data, which for some applications can be costly or impossible to obtain. Therefore, it has become necessary to develop approaches that reduce the need and the effort to obtain new labeled samples by exploiting data that are available in related areas, and using these further across similar fields. This has given rise to a new machine learning framework known as transfer learning: a learning setting inspired by the capability of a human being to extrapolate knowledge across tasks to learn more efficiently. Despite a large amount of different transfer learning scenarios, the main objective of this survey is to provide an overview of the state-of-the-art theoretical results in a specific, and arguably the most popular, sub-field of transfer learning, called domain adaptation. In this sub-field, the data distribution is assumed to change across the training and the test data, while the learning task remains the same. We provide a first up-to-date description of existing results related to domain adaptation problem that cover learning bounds based on different statistical learning frameworks.
    Adversarial Graph Contrastive Learning with Information Regularization. (arXiv:2202.06491v4 [cs.LG] UPDATED)
    Contrastive learning is an effective unsupervised method in graph representation learning. Recently, the data augmentation based contrastive learning method has been extended from images to graphs. However, most prior works are directly adapted from the models designed for images. Unlike the data augmentation on images, the data augmentation on graphs is far less intuitive and much harder to provide high-quality contrastive samples, which are the key to the performance of contrastive learning models. This leaves much space for improvement over the existing graph contrastive learning frameworks. In this work, by introducing an adversarial graph view and an information regularizer, we propose a simple but effective method, Adversarial Graph Contrastive Learning (ARIEL), to extract informative contrastive samples within a reasonable constraint. It consistently outperforms the current graph contrastive learning methods in the node classification task over various real-world datasets and further improves the robustness of graph contrastive learning.
    Interpretable Decision Trees Through MaxSAT. (arXiv:2110.13854v2 [cs.AI] UPDATED)
    We present an approach to improve the accuracy-interpretability trade-off of Machine Learning (ML) Decision Trees (DTs). In particular, we apply Maximum Satisfiability technology to compute Minimum Pure DTs (MPDTs). We improve the runtime of previous approaches and, show that these MPDTs can outperform the accuracy of DTs generated with the ML framework sklearn.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v3 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature of statistical learning. This work aims to bringing attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of the Bayesian estimator and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from perfect to an uncorrelated learning. Based on this finding, we design a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.
    Multilinguals at SemEval-2022 Task 11: Complex NER in Semantically Ambiguous Settings for Low Resource Languages. (arXiv:2207.06882v1 [cs.CL])
    We leverage pre-trained language models to solve the task of complex NER for two low-resource languages: Chinese and Spanish. We use the technique of Whole Word Masking(WWM) to boost the performance of masked language modeling objective on large and unsupervised corpora. We experiment with multiple neural network architectures, incorporating CRF, BiLSTMs, and Linear Classifiers on top of a fine-tuned BERT layer. All our models outperform the baseline by a significant margin and our best performing model obtains a competitive position on the evaluation leaderboard for the blind test set.
    Instance Selection Mechanisms for Human-in-the-Loop Systems in Few-Shot Learning. (arXiv:2207.06835v1 [cs.LG])
    Business analytics and machine learning have become essential success factors for various industries - with the downside of cost-intensive gathering and labeling of data. Few-shot learning addresses this challenge and reduces data gathering and labeling costs by learning novel classes with very few labeled data. In this paper, we design a human-in-the-loop (HITL) system for few-shot learning and analyze an extensive range of mechanisms that can be used to acquire human expert knowledge for instances that have an uncertain prediction outcome. We show that the acquisition of human expert knowledge significantly accelerates the few-shot model performance given a negligible labeling effort. We validate our findings in various experiments on a benchmark dataset in computer vision and real-world datasets. We further demonstrate the cost-effectiveness of HITL systems for few-shot learning. Overall, our work aims at supporting researchers and practitioners in effectively adapting machine learning models to novel classes at reduced costs.
    Confident Adaptive Language Modeling. (arXiv:2207.07061v1 [cs.CL])
    Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, potentially leading to slow and costly use at inference time. In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models' full capacity, other continuations are more trivial and can be solved with reduced compute. In this work, we introduce Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $\times 3$ -- while provably maintaining high performance.
    Reachability Analysis of a General Class of Neural Ordinary Differential Equations. (arXiv:2207.06531v1 [cs.LG])
    Continuous deep learning models, referred to as Neural Ordinary Differential Equations (Neural ODEs), have received considerable attention over the last several years. Despite their burgeoning impact, there is a lack of formal analysis techniques for these systems. In this paper, we consider a general class of neural ODEs with varying architectures and layers, and introduce a novel reachability framework that allows for the formal analysis of their behavior. The methods developed for the reachability analysis of neural ODEs are implemented in a new tool called NNVODE. Specifically, our work extends an existing neural network verification tool to support neural ODEs. We demonstrate the capabilities and efficacy of our methods through the analysis of a set of benchmarks that include neural ODEs used for classification, and in control and dynamical systems, including an evaluation of the efficacy and capabilities of our approach with respect to existing software tools within the continuous-time systems reachability literature, when it is possible to do so.
    PIAT: Physics Informed Adversarial Training for Solving Partial Differential Equations. (arXiv:2207.06647v1 [cs.LG])
    In this paper, we propose the physics informed adversarial training (PIAT) of neural networks for solving nonlinear differential equations (NDE). It is well-known that the standard training of neural networks results in non-smooth functions. Adversarial training (AT) is an established defense mechanism against adversarial attacks, which could also help in making the solution smooth. AT include augmenting the training mini-batch with a perturbation that makes the network output mismatch the desired output adversarially. Unlike formal AT, which relies only on the training data, here we encode the governing physical laws in the form of nonlinear differential equations using automatic differentiation in the adversarial network architecture. We compare PIAT with PINN to indicate the effectiveness of our method in solving NDEs for up to 10 dimensions. Moreover, we propose weight decay and Gaussian smoothing to demonstrate the PIAT advantages. The code repository is available at https://github.com/rohban-lab/PIAT.
    Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources. (arXiv:2207.06667v1 [cs.DC])
    Although more layers and more parameters generally improve the accuracy of the models, such big models generally have high computational complexity and require big memory, which exceed the capacity of small devices for inference and incurs long training time. In addition, it is difficult to afford long training time and inference time of big models even in high performance servers, as well. As an efficient approach to compress a large deep model (a teacher model) to a compact model (a student model), knowledge distillation emerges as a promising approach to deal with the big models. Existing knowledge distillation methods cannot exploit the elastic available computing resources and correspond to low efficiency. In this paper, we propose an Elastic Deep Learning framework for knowledge Distillation, i.e., EDL-Dist. The advantages of EDL-Dist are three-fold. First, the inference and the training process is separated. Second, elastic available computing resources can be utilized to improve the efficiency. Third, fault-tolerance of the training and inference processes is supported. We take extensive experimentation to show that the throughput of EDL-Dist is up to 3.125 times faster than the baseline method (online knowledge distillation) while the accuracy is similar or higher.
    A Unified Granular-ball Learning Model of Pawlak Rough Set and Neighborhood Rough Set. (arXiv:2201.03349v4 [cs.AI] UPDATED)
    Pawlak rough set and neighborhood rough set are the two most common rough set theoretical models. Pawlak can use equivalence classes to represent knowledge, but it cannot process continuous data; neighborhood rough sets can process continuous data, but it loses the ability of using equivalence classes to represent knowledge. To this end, this paper presents a granular-ball rough set based on the granular-ball computing. The granular-ball rough set can simultaneously represent Pawlak rough sets, and the neighborhood rough set, so as to realize the unified representation of the two. This makes the granular-ball rough set not only can deal with continuous data, but also can use equivalence classes for knowledge representation. In addition, we propose an implementation algorithms of granular-ball rough sets. The experimental results on benchmark datasets demonstrate that, due to the combination of the robustness and adaptability of the granular-ball computing, the learning accuracy of the granular-ball rough set has been greatly improved compared with the Pawlak rough set and the traditional neighborhood rough set. The granular-ball rough set also outperforms nine popular or the state-of-the-art feature selection methods.
    Blurs Behave Like Ensembles: Spatial Smoothings to Improve Accuracy, Uncertainty, and Robustness. (arXiv:2105.12639v4 [cs.LG] UPDATED)
    Neural network ensembles, such as Bayesian neural networks (BNNs), have shown success in the areas of uncertainty estimation and robustness. However, a crucial challenge prohibits their use in practice. BNNs require a large number of predictions to produce reliable results, leading to a significant increase in computational cost. To alleviate this issue, we propose spatial smoothing, a method that spatially ensembles neighboring feature map points of convolutional neural networks. By simply adding a few blur layers to the models, we empirically show that spatial smoothing improves accuracy, uncertainty estimation, and robustness of BNNs across a whole range of ensemble sizes. In particular, BNNs incorporating spatial smoothing achieve high predictive performance merely with a handful of ensembles. Moreover, this method also can be applied to canonical deterministic neural networks to improve the performances. A number of evidences suggest that the improvements can be attributed to the stabilized feature maps and the smoothing of the loss landscape. In addition, we provide a fundamental explanation for prior works - namely, global average pooling, pre-activation, and ReLU6 - by addressing them as special cases of spatial smoothing. These not only enhance accuracy, but also improve uncertainty estimation and robustness by making the loss landscape smoother in the same manner as spatial smoothing. The code is available at https://github.com/xxxnell/spatial-smoothing.
    CoSCL: Cooperation of Small Continual Learners is Stronger than a Big One. (arXiv:2207.06543v1 [cs.LG])
    Continual learning requires incremental compatibility with a sequence of tasks. However, the design of model architecture remains an open question: In general, learning all tasks with a shared set of parameters suffers from severe interference between tasks; while learning each task with a dedicated parameter subspace is limited by scalability. In this work, we theoretically analyze the generalization errors for learning plasticity and memory stability in continual learning, which can be uniformly upper-bounded by (1) discrepancy between task distributions, (2) flatness of loss landscape and (3) cover of parameter space. Then, inspired by the robust biological learning system that processes sequential experiences with multiple parallel compartments, we propose Cooperation of Small Continual Learners (CoSCL) as a general strategy for continual learning. Specifically, we present an architecture with a fixed number of narrower sub-networks to learn all incremental tasks in parallel, which can naturally reduce the two errors through improving the three components of the upper bound. To strengthen this advantage, we encourage to cooperate these sub-networks by penalizing the difference of predictions made by their feature representations. With a fixed parameter budget, CoSCL can improve a variety of representative continual learning approaches by a large margin (e.g., up to 10.64% on CIFAR-100-SC, 9.33% on CIFAR-100-RS, 11.45% on CUB-200-2011 and 6.72% on Tiny-ImageNet) and achieve the new state-of-the-art performance.
    Recurrent Memory Transformer. (arXiv:2207.06881v1 [cs.CL])
    Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then Transformer is trained to control both memory operations and sequence representations processing. Results of experiments show that our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve it performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.
    Near-Optimal Bounds for Testing Histogram Distributions. (arXiv:2207.06596v1 [cs.DS])
    We investigate the problem of testing whether a discrete probability distribution over an ordered domain is a histogram on a specified number of bins. One of the most common tools for the succinct approximation of data, $k$-histograms over $[n]$, are probability distributions that are piecewise constant over a set of $k$ intervals. The histogram testing problem is the following: Given samples from an unknown distribution $\mathbf{p}$ on $[n]$, we want to distinguish between the cases that $\mathbf{p}$ is a $k$-histogram versus $\varepsilon$-far from any $k$-histogram, in total variation distance. Our main result is a sample near-optimal and computationally efficient algorithm for this testing problem, and a nearly-matching (within logarithmic factors) sample complexity lower bound. Specifically, we show that the histogram testing problem has sample complexity $\widetilde \Theta (\sqrt{nk} / \varepsilon + k / \varepsilon^2 + \sqrt{n} / \varepsilon^2)$.
    Analysis of Catastrophic Forgetting for Random Orthogonal Transformation Tasks in the Overparameterized Regime. (arXiv:2207.06475v1 [cs.LG])
    Overparameterization is known to permit strong generalization performance in neural networks. In this work, we provide an initial theoretical analysis of its effect on catastrophic forgetting in a continual learning setup. We show experimentally that in permuted MNIST image classification tasks, the generalization performance of multilayer perceptrons trained by vanilla stochastic gradient descent can be improved by overparameterization, and the extent of the performance increase achieved by overparameterization is comparable to that of state-of-the-art continual learning algorithms. We provide a theoretical explanation of this effect by studying a qualitatively similar two-task linear regression problem, where each task is related by a random orthogonal transformation. We show that when a model is trained on the two tasks in sequence without any additional regularization, the risk gain on the first task is small if the model is sufficiently overparameterized.
    Cross-Modal Transformer GAN: A Brain Structure-Function Deep Fusing Framework for Alzheimer's Disease. (arXiv:2206.13393v2 [eess.IV] UPDATED)
    Cross-modal fusion of different types of neuroimaging data has shown great promise for predicting the progression of Alzheimer's Disease(AD). However, most existing methods applied in neuroimaging can not efficiently fuse the functional and structural information from multi-modal neuroimages. In this work, a novel cross-modal transformer generative adversarial network(CT-GAN) is proposed to fuse functional information contained in resting-state functional magnetic resonance imaging (rs-fMRI) and structural information contained in Diffusion Tensor Imaging (DTI). The developed bi-attention mechanism can match functional information to structural information efficiently and maximize the capability of extracting complementary information from rs-fMRI and DTI. By capturing the deep complementary information between structural features and functional features, the proposed CT-GAN can detect the AD-related brain connectivity, which could be used as a bio-marker of AD. Experimental results show that the proposed model can not only improve classification performance but also detect the AD-related brain connectivity effectively.
    Improved OOD Generalization via Conditional Invariant Regularizer. (arXiv:2207.06687v1 [cs.LG])
    Recently, generalization on out-of-distribution (OOD) data with correlation shift has attracted great attention. The correlation shift is caused by the spurious attributes that correlate to the class label, as the correlation between them may vary in training and test data. For such a problem, we show that given the class label, the conditionally independent models of spurious attributes are OOD generalizable. Based on this, a metric Conditional Spurious Variation (CSV) which controls OOD generalization error, is proposed to measure such conditional independence. To improve the OOD generalization, we regularize the training process with the proposed CSV. Under mild assumptions, our training objective can be formulated as a nonconvex-concave mini-max problem. An algorithm with provable convergence rate is proposed to solve the problem. Extensive empirical results verify our algorithm's efficacy in improving OOD generalization.
    Multi-Level Branched Regularization for Federated Learning. (arXiv:2207.06936v1 [cs.LG])
    A critical challenge of federated learning is data heterogeneity and imbalance across clients, which leads to inconsistency between local networks and unstable convergence of global models. To alleviate the limitations, we propose a novel architectural regularization technique that constructs multiple auxiliary branches in each local model by grafting local and global subnetworks at several different levels and that learns the representations of the main pathway in the local model congruent to the auxiliary hybrid pathways via online knowledge distillation. The proposed technique is effective to robustify the global model even in the non-iid setting and is applicable to various federated learning frameworks conveniently without incurring extra communication costs. We perform comprehensive empirical studies and demonstrate remarkable performance gains in terms of accuracy and efficiency compared to existing methods. The source code is available at our project page.
    MedFuse: Multi-modal fusion with clinical time-series data and chest X-ray images. (arXiv:2207.07027v1 [eess.IV])
    Multi-modal fusion approaches aim to integrate information from different data sources. Unlike natural datasets, such as in audio-visual applications, where samples consist of "paired" modalities, data in healthcare is often collected asynchronously. Hence, requiring the presence of all modalities for a given sample is not realistic for clinical tasks and significantly limits the size of the dataset during training. In this paper, we propose MedFuse, a conceptually simple yet promising LSTM-based fusion module that can accommodate uni-modal as well as multi-modal input. We evaluate the fusion method and introduce new benchmark results for in-hospital mortality prediction and phenotype classification, using clinical time-series data in the MIMIC-IV dataset and corresponding chest X-ray images in MIMIC-CXR. Compared to more complex multi-modal fusion strategies, MedFuse provides a performance improvement by a large margin on the fully paired test set. It also remains robust across the partially paired test set containing samples with missing chest X-ray images. We release our code for reproducibility and to enable the evaluation of competing models in the future.
    A Query-Optimal Algorithm for Finding Counterfactuals. (arXiv:2207.07072v1 [cs.DS])
    We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model $f : X^d \to \{0,1\}$ and instance $x^\star$, our algorithm makes \[ {S(f)^{O(\Delta_f(x^\star))}\cdot \log d}\] queries to $f$ and returns {an {\sl optimal}} counterfactual for $x^\star$: a nearest instance $x'$ to $x^\star$ for which $f(x')\ne f(x^\star)$. Here $S(f)$ is the sensitivity of $f$, a discrete analogue of the Lipschitz constant, and $\Delta_f(x^\star)$ is the distance from $x^\star$ to its nearest counterfactuals. The previous best known query complexity was $d^{\,O(\Delta_f(x^\star))}$, achievable by brute-force local search. We further prove a lower bound of $S(f)^{\Omega(\Delta_f(x^\star))} + \Omega(\log d)$ on the query complexity of any algorithm, thereby showing that the guarantees of our algorithm are essentially optimal.
    Equivariant Hypergraph Diffusion Neural Operators. (arXiv:2207.06680v1 [cs.LG])
    Hypergraph neural networks (HNNs) using neural networks to encode hypergraphs provide a promising way to model higher-order relations in data and further solve relevant prediction tasks built upon such higher-order relations. However, higher-order relations in practice contain complex patterns and are often highly irregular. So, it is often challenging to design an HNN that suffices to express those relations while keeping computational efficiency. Inspired by hypergraph diffusion algorithms, this work proposes a new HNN architecture named ED-HNN, which provably represents any continuous equivariant hypergraph diffusion operators that can model a wide range of higher-order relations. ED-HNN can be implemented efficiently by combining star expansions of hypergraphs with standard message passing neural networks. ED-HNN further shows great superiority in processing heterophilic hypergraphs and constructing deep models. We evaluate ED-HNN for node classification on nine real-world hypergraph datasets. ED-HNN uniformly outperforms the best baselines over these nine datasets and achieves more than 2\%$\uparrow$ in prediction accuracy over four datasets therein.
    Closing the Loop: A Framework for Trustworthy Machine Learning in Power Systems. (arXiv:2203.07505v2 [eess.SY] UPDATED)
    Deep decarbonization of the energy sector will require massive penetration of stochastic renewable energy resources and an enormous amount of grid asset coordination; this represents a challenging paradigm for the power system operators who are tasked with maintaining grid stability and security in the face of such changes. With its ability to learn from complex datasets and provide predictive solutions on fast timescales, machine learning (ML) is well-posed to help overcome these challenges as power systems transform in the coming decades. In this work, we outline five key challenges (dataset generation, data pre-processing, model training, model assessment, and model embedding) associated with building trustworthy ML models which learn from physics-based simulation data. We then demonstrate how linking together individual modules, each of which overcomes a respective challenge, at sequential stages in the machine learning pipeline can help enhance the overall performance of the training process. In particular, we implement methods that connect different elements of the learning pipeline through feedback, thus "closing the loop" between model training, performance assessments, and re-training. We demonstrate the effectiveness of this framework, its constituent modules, and its feedback connections by learning the N-1 small-signal stability margin associated with a detailed model of a proposed North Sea Wind Power Hub system.
    Evaluating Multimodal Interactive Agents. (arXiv:2205.13274v2 [cs.LG] UPDATED)
    Creating agents that can interact naturally with humans is a common goal in artificial intelligence (AI) research. However, evaluating these interactions is challenging: collecting online human-agent interactions is slow and expensive, yet faster proxy metrics often do not correlate well with interactive evaluation. In this paper, we assess the merits of these existing evaluation metrics and present a novel approach to evaluation called the Standardised Test Suite (STS). The STS uses behavioural scenarios mined from real human interaction data. Agents see replayed scenario context, receive an instruction, and are then given control to complete the interaction offline. These agent continuations are recorded and sent to human annotators to mark as success or failure, and agents are ranked according to the proportion of continuations in which they succeed. The resulting STS is fast, controlled, interpretable, and representative of naturalistic interactions. Altogether, the STS consolidates much of what is desirable across many of our standard evaluation metrics, allowing us to accelerate research progress towards producing agents that can interact naturally with humans. A video may be found at https://youtu.be/YR1TngGORGQ.
    HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning. (arXiv:2201.04182v3 [cs.LG] UPDATED)
    In this work we propose a HyperTransformer, a Transformer-based model for supervised and semi-supervised few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Since the dependence of a small generated CNN model on a specific task is encoded by a high-capacity Transformer model, we effectively decouple the complexity of the large task space from the complexity of individual tasks. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal and better performance is attained when the information about the task can modulate all model parameters. For larger models we discover that generating the last layer alone allows us to produce competitive or better results than those obtained with state-of-the-art methods while being end-to-end differentiable.
    Learning Representations for CSI Adaptive Quantization and Feedback. (arXiv:2207.06924v1 [eess.SP])
    In this work, we propose an efficient method for channel state information (CSI) adaptive quantization and feedback in frequency division duplexing (FDD) systems. Existing works mainly focus on the implementation of autoencoder (AE) neural networks (NNs) for CSI compression, and consider straightforward quantization methods, e.g., uniform quantization, which are generally not optimal. With this strategy, it is hard to achieve a low reconstruction error, especially, when the available number of bits reserved for the latent space quantization is small. To address this issue, we recommend two different methods: one based on a post training quantization and the second one in which the codebook is found during the training of the AE. Both strategies achieve better reconstruction accuracy compared to standard quantization techniques.
    Contextual Inverse Optimization: Offline and Online Learning. (arXiv:2106.14015v2 [cs.LG] UPDATED)
    We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. In the offline setting, the decision-maker has information available from past periods and needs to make one decision, while in the online setting, the decision-maker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon.
    Identifying Orientation-specific Lipid-protein Fingerprints using Deep Learning. (arXiv:2207.06630v1 [q-bio.BM])
    Improved understanding of the relation between the behavior of RAS and RAF proteins and the local lipid environment in the cell membrane is critical for getting insights into the mechanisms underlying cancer formation. In this work, we employ deep learning (DL) to learn this relationship by predicting protein orientational states of RAS and RAS-RAF protein complexes with respect to the lipid membrane based on the lipid densities around the protein domains from coarse-grained (CG) molecular dynamics (MD) simulations. Our DL model can predict six protein states with an overall accuracy of over 80%. The findings of this work offer new insights into how the proteins modulate the lipid environment, which in turn may assist designing novel therapies to regulate such interactions in the mechanisms associated with cancer development.
    Regotron: Regularizing the Tacotron2 architecture via monotonic alignment loss. (arXiv:2204.13437v2 [cs.SD] UPDATED)
    Recent deep learning Text-to-Speech (TTS) systems have achieved impressive performance by generating speech close to human parity. However, they suffer from training stability issues as well as incorrect alignment of the intermediate acoustic representation with the input text sequence. In this work, we introduce Regotron, a regularized version of Tacotron2 which aims to alleviate the training issues and at the same time produce monotonic alignments. Our method augments the vanilla Tacotron2 objective function with an additional term, which penalizes non-monotonic alignments in the location-sensitive attention mechanism. By properly adjusting this regularization term we show that the loss curves become smoother, and at the same time Regotron consistently produces monotonic alignments in unseen examples even at an early stage (13\% of the total number of epochs) of its training process, whereas the fully converged Tacotron2 fails to do so. Moreover, our proposed regularization method has no additional computational overhead, while reducing common TTS mistakes and achieving slighlty improved speech naturalness according to subjective mean opinion scores (MOS) collected from 50 evaluators.
    A Data-Efficient Deep Learning Framework for Segmentation and Classification of Histopathology Images. (arXiv:2207.06489v1 [eess.IV])
    The current study of cell architecture of inflammation in histopathology images commonly performed for diagnosis and research purposes excludes a lot of information available on the biopsy slide. In autoimmune diseases, major outstanding research questions remain regarding which cell types participate in inflammation at the tissue level,and how they interact with each other. While these questions can be partially answered using traditional methods, artificial intelligence approaches for segmentation and classification provide a much more efficient method to understand the architecture of inflammation in autoimmune disease, holding a great promise for novel insights. In this paper, we empirically develop deep learning approaches that uses dermatomyositis biopsies of human tissue to detect and identify inflammatory cells. Our approach improves classification performance by 26% and segmentation performance by 5%. We also propose a novel post-processing autoencoder architecture that improves segmentation performance by an additional 3%. We have open-sourced our approach and architecture at https://github.com/pranavsinghps1/DEDL
    FOCUS: Familiar Objects in Common and Uncommon Settings. (arXiv:2110.03804v2 [cs.CV] UPDATED)
    Standard training datasets for deep learning often contain objects in common settings (e.g., "a horse on grass" or "a ship in water") since they are usually collected by randomly scraping the web. Uncommon and rare settings (e.g., "a plane on water", "a car in snowy weather") are thus severely under-represented in the training data. This can lead to an undesirable bias in model predictions towards common settings and create a false sense of accuracy. In this paper, we introduce FOCUS (Familiar Objects in Common and Uncommon Settings), a dataset for stress-testing the generalization power of deep image classifiers. By leveraging the power of modern search engines, we deliberately gather data containing objects in common and uncommon settings in a wide range of locations, weather conditions, and time of day. We present a detailed analysis of the performance of various popular image classifiers on our dataset and demonstrate a clear drop in performance when classifying images in uncommon settings. By analyzing deep features of these models, we show that such errors can be due to the use of spurious features in model predictions. We believe that our dataset will aid researchers in understanding the inability of deep models to generalize well to uncommon settings and drive future work on improving their distributional robustness.
    Have we been Naive to Select Machine Learning Models? Noisy Data are here to Stay!. (arXiv:2207.06651v1 [cs.LG])
    The model selection procedure is usually a single-criterion decision making in which we select the model that maximizes a specific metric in a specific set, such as the Validation set performance. We claim this is very naive and can perform poor selections of over-fitted models due to the over-searching phenomenon, which over-estimates the performance on that specific set. Futhermore, real world data contains noise that should not be ignored by the model selection procedure and must be taken into account when performing model selection. Also, we have defined four theoretical optimality conditions that we can pursue to better select the models and analyze them by using a multi-criteria decision-making algorithm (TOPSIS) that considers proxies to the optimality conditions to select reasonable models.  ( 2 min )
    Open High-Resolution Satellite Imagery: The WorldStrat Dataset -- With Application to Super-Resolution. (arXiv:2207.06418v1 [eess.IV])
    Analyzing the planet at scale with satellite imagery and machine learning is a dream that has been constantly hindered by the cost of difficult-to-access highly-representative high-resolution imagery. To remediate this, we introduce here the WorldStrat dataset. The largest and most varied such publicly available dataset, at Airbus SPOT 6/7 satellites' high resolution of up to 1.5 m/pixel, empowered by European Space Agency's Phi-Lab as part of the ESA-funded QueryPlanet project, we curate nearly 10,000 sqkm of unique locations to ensure stratified representation of all types of land-use across the world: from agriculture to ice caps, from forests to multiple urbanization densities. We also enrich those with locations typically under-represented in ML datasets: sites of humanitarian interest, illegal mining sites, and settlements of persons at risk. We temporally-match each high-resolution image with multiple low-resolution images from the freely accessible lower-resolution Sentinel-2 satellites at 10 m/pixel. We accompany this dataset with an open-source Python package to: rebuild or extend the WorldStrat dataset, train and infer baseline algorithms, and learn with abundant tutorials, all compatible with the popular EO-learn toolbox. We hereby hope to foster broad-spectrum applications of ML to satellite imagery, and possibly develop from free public low-resolution Sentinel2 imagery the same power of analysis allowed by costly private high-resolution imagery. We illustrate this specific point by training and releasing several highly compute-efficient baselines on the task of Multi-Frame Super-Resolution. High-resolution Airbus imagery is CC BY-NC, while the labels and Sentinel2 imagery are CC BY, and the source code and pre-trained models under BSD. The dataset is available at https://zenodo.org/record/6810792 and the software package at https://github.com/worldstrat/worldstrat .
    Low-skilled Occupations Face the Highest Re-skilling Pressure. (arXiv:2101.11505v2 [cs.CY] UPDATED)
    Substantial scholarship has estimated the susceptibility of jobs to automation, but little has examined how job contents evolve in the information age as new technologies substitute for tasks, shifting required skills rather than eliminating entire jobs. Here we explore the patterns and consequences of changes in occupational skill contents and characterize occupations and workers subject to the greatest re-skilling pressure. Recent research suggests that high-skilled STEM and technology-intensive occupations have experienced the highest rates of skill content change. Analyzing 727 occupations across 167 million job posts covering the near-universe of the U.S. online labor market between 2010 and 2018, we find that when skill distance is accounted for, re-skilling pressure is much higher for low-skilled occupations, no matter how ``low-skill'' is defined, either by skill number, pay level, or education degree. We investigate the implications of uneven occupational skill change on workers and find that those from large labor markets and large employers experienced less change, while non-white males in low-skill jobs are the most demographically vulnerable. We conclude by discussing the broad potential of our skill embedding model, which learns skill proximity from skill co-presence across job posts and represents it as distance in the high-dimensional space of complex human capital that corresponds with skilling costs for workers. This model offers a fine-grained measure of the extent to which jobs evolve, and also indicates in what direction job are evolving, as illustrated by the decline in demand for human-interface skills and the rise for those at the machine-interface.
    Learning to Detect Slip with Barometric Tactile Sensors and a Temporal Convolutional Neural Network. (arXiv:2202.09549v2 [cs.RO] UPDATED)
    The ability to perceive object slip via tactile feedback enables humans to accomplish complex manipulation tasks including maintaining a stable grasp. Despite the utility of tactile information for many applications, tactile sensors have yet to be widely deployed in industrial robotics settings; part of the challenge lies in identifying slip and other events from the tactile data stream. In this paper, we present a learning-based method to detect slip using barometric tactile sensors. These sensors have many desirable properties including high durability and reliability, and are built from inexpensive, off-the-shelf components. We train a temporal convolution neural network to detect slip, achieving high detection accuracies while displaying robustness to the speed and direction of the slip motion. Further, we test our detector on two manipulation tasks involving a variety of common objects and demonstrate successful generalization to real-world scenarios not seen during training. We argue that barometric tactile sensing technology, combined with data-driven learning, is suitable for many manipulation tasks such as slip compensation.
    Auto-weighted Robust Federated Learning with Corrupted Data Sources. (arXiv:2101.05880v3 [cs.LG] UPDATED)
    Federated learning provides a communication-efficient and privacy-preserving training process by enabling learning statistical models with massive participants while keeping their data in local clients. However, standard federated learning techniques that naively minimize an average loss function are vulnerable to data corruptions from outliers, systematic mislabeling, or even adversaries. In addition, it is often prohibited for service providers to verify the quality of data samples due to the increasing concern of user data privacy. In this paper, we address this challenge by proposing Auto-weighted Robust Federated Learning (arfl), a novel approach that jointly learns the global model and the weights of local updates to provide robustness against corrupted data sources. We prove a learning bound on the expected risk with respect to the predictor and the weights of clients, which guides the definition of the objective for robust federated learning. The weights are allocated by comparing the empirical loss of a client with the average loss of the best p clients (p-average), thus we can downweight the clients with significantly high losses, thereby lower their contributions to the global model. We show that this approach achieves robustness when the data of corrupted clients is distributed differently from benign ones. To optimize the objective function, we propose a communication-efficient algorithm based on the blockwise minimization paradigm. We conduct experiments on multiple benchmark datasets, including CIFAR-10, FEMNIST and Shakespeare, considering different deep neural network models. The results show that our solution is robust against different scenarios including label shuffling, label flipping and noisy features, and outperforms the state-of-the-art methods in most scenarios.
    Fixing Inventory Inaccuracies At Scale. (arXiv:2006.13126v3 [stat.ML] UPDATED)
    Inaccurate records of inventory occur frequently, and by some measures cost retailers approximately 4% in annual sales. Detecting inventory inaccuracies manually is cost-prohibitive, and existing algorithmic solutions rely almost exclusively on learning from longitudinal data, which is insufficient in the dynamic environment induced by modern retail operations. Instead, we propose a solution based on cross-sectional data over stores and SKUs, observing that detecting inventory inaccuracies can be viewed as a problem of identifying anomalies in a (low-rank) Poisson matrix. State-of-the-art approaches to anomaly detection in low-rank matrices apparently fall short. Specifically, from a theoretical perspective, recovery guarantees for these approaches require that non-anomalous entries be observed with vanishingly small noise (which is not the case in our problem, and indeed in many applications). So motivated, we propose a conceptually simple entry-wise approach to anomaly detection in low-rank Poisson matrices. Our approach accommodates a general class of probabilistic anomaly models. We show that the cost incurred by our algorithm approaches that of an optimal algorithm at a min-max optimal rate. Using synthetic data and real data from a consumer goods retailer, we show that our approach provides up to a 10x cost reduction over incumbent approaches to anomaly detection. Along the way, we build on recent work that seeks entry-wise error guarantees for matrix completion, establishing such guarantees for sub-exponential matrices, a result of independent interest.
    Low-Precision Arithmetic for Fast Gaussian Processes. (arXiv:2207.06856v1 [cs.LG])
    Low-precision arithmetic has had a transformative effect on the training of neural networks, reducing computation, memory and energy requirements. However, despite its promise, low-precision arithmetic has received little attention for Gaussian processes (GPs), largely because GPs require sophisticated linear algebra routines that are unstable in low-precision. We study the different failure modes that can occur when training GPs in half precision. To circumvent these failure modes, we propose a multi-faceted approach involving conjugate gradients with re-orthogonalization, mixed precision, and preconditioning. Our approach significantly improves the numerical stability and practical performance of conjugate gradients in low-precision over a wide range of settings, enabling GPs to train on $1.8$ million data points in $10$ hours on a single GPU, without any sparse approximations.
    A Meta-learning Formulation of the Autoencoder Problem. (arXiv:2207.06676v1 [cs.LG])
    A rapidly growing area of research is the use of machine learning approaches such as autoencoders for dimensionality reduction of data and models in scientific applications. We show that the canonical formulation of autoencoders suffers from several deficiencies that can hinder their performance. Using a meta-learning approach, we reformulate the autoencoder problem as a bi-level optimization procedure that explicitly solves the dimensionality reduction task. We prove that the new formulation corrects the identified deficiencies with canonical autoencoders, provide a practical way to solve it, and showcase the strength of this formulation with a simple numerical illustration.
    DropNet: Reducing Neural Network Complexity via Iterative Pruning. (arXiv:2207.06646v1 [cs.LG])
    Modern deep neural networks require a significant amount of computing time and power to train and deploy, which limits their usage on edge devices. Inspired by the iterative weight pruning in the Lottery Ticket Hypothesis, we propose DropNet, an iterative pruning method which prunes nodes/filters to reduce network complexity. DropNet iteratively removes nodes/filters with the lowest average post-activation value across all training samples. Empirically, we show that DropNet is robust across diverse scenarios, including MLPs and CNNs using the MNIST, CIFAR-10 and Tiny ImageNet datasets. We show that up to 90% of the nodes/filters can be removed without any significant loss of accuracy. The final pruned network performs well even with reinitialization of the weights and biases. DropNet also has similar accuracy to an oracle which greedily removes nodes/filters one at a time to minimise training loss, highlighting its effectiveness.
    Anomal-E: A Self-Supervised Network Intrusion Detection System based on Graph Neural Networks. (arXiv:2207.06819v1 [cs.LG])
    This paper investigates Graph Neural Networks (GNNs) application for self-supervised network intrusion and anomaly detection. GNNs are a deep learning approach for graph-based data that incorporate graph structures into learning to generalise graph representations and output embeddings. As network flows are naturally graph-based, GNNs are a suitable fit for analysing and learning network behaviour. The majority of current implementations of GNN-based Network Intrusion Detection Systems (NIDSs) rely heavily on labelled network traffic which can not only restrict the amount and structure of input traffic, but also the NIDSs potential to adapt to unseen attacks. To overcome these restrictions, we present Anomal-E, a GNN approach to intrusion and anomaly detection that leverages edge features and graph topological structure in a self-supervised process. This approach is, to the best our knowledge, the first successful and practical approach to network intrusion detection that utilises network flows in a self-supervised, edge leveraging GNN. Experimental results on two modern benchmark NIDS datasets not only clearly display the improvement of using Anomal-E embeddings rather than raw features, but also the potential Anomal-E has for detection on wild network traffic.
    Towards Adaptive Unknown Authentication for Universal Domain Adaptation by Classifier Paradox. (arXiv:2207.04494v1 [cs.CV] CROSS LISTED)
    Universal domain adaptation (UniDA) is a general unsupervised domain adaptation setting, which addresses both domain and label shifts in adaptation. Its main challenge lies in how to identify target samples in unshared or unknown classes. Previous methods commonly strive to depict sample "confidence" along with a threshold for rejecting unknowns, and align feature distributions of shared classes across domains. However, it is still hard to pre-specify a "confidence" criterion and threshold which are adaptive to various real tasks, and a mis-prediction of unknowns further incurs misalignment of features in shared classes. In this paper, we propose a new UniDA method with adaptive Unknown Authentication by Classifier Paradox (UACP), considering that samples with paradoxical predictions are probably unknowns belonging to none of the source classes. In UACP, a composite classifier is jointly designed with two types of predictors. That is, a multi-class (MC) predictor classifies samples to one of the multiple source classes, while a binary one-vs-all (OVA) predictor further verifies the prediction by MC predictor. Samples with verification failure or paradox are identified as unknowns. Further, instead of feature alignment for shared classes, implicit domain alignment is conducted in output space such that samples across domains share the same decision boundary, though with feature discrepancy. Empirical results validate UACP under both open-set and universal UDA settings.
    Deep Dictionary Learning with An Intra-class Constraint. (arXiv:2207.06841v1 [cs.LG])
    In recent years, deep dictionary learning (DDL)has attracted a great amount of attention due to its effectiveness for representation learning and visual recognition.~However, most existing methods focus on unsupervised deep dictionary learning, failing to further explore the category information.~To make full use of the category information of different samples, we propose a novel deep dictionary learning model with an intra-class constraint (DDLIC) for visual classification. Specifically, we design the intra-class compactness constraint on the intermediate representation at different levels to encourage the intra-class representations to be closer to each other, and eventually the learned representation becomes more discriminative.~Unlike the traditional DDL methods, during the classification stage, our DDLIC performs a layer-wise greedy optimization in a similar way to the training stage. Experimental results on four image datasets show that our method is superior to the state-of-the-art methods.
    Collaborative Machine Learning-Driven Internet of Medical Things -- A Systematic Literature Review. (arXiv:2207.06416v1 [cs.LG])
    The growing adoption of IoT devices for healthcare has enabled researchers to build intelligence using all the data produced by these devices. Monitoring and diagnosing health have been the two most common scenarios where such devices have proven beneficial. Achieving high prediction accuracy was a top priority initially, but the focus has slowly shifted to efficiency and higher throughput, and processing the data from these devices in a distributed manner has proven to help achieve both. Since the field of machine learning is vast with numerous state-of-the-art algorithms in play, it has been a challenge to identify the algorithms that perform best in different scenarios. In this literature review, we explored the distributed machine learning algorithms tested by the authors of the selected studies and identified the ones that achieved the best prediction accuracy in each healthcare scenario. While no algorithm performed consistently, Random Forest performed the best in a few studies. This could serve as a good starting point for future studies on collaborative machine learning on IoMT data.
    Differentiable Logics for Neural Network Training and Verification. (arXiv:2207.06741v1 [cs.AI])
    The rising popularity of neural networks (NNs) in recent years and their increasing prevalence in real-world applications have drawn attention to the importance of their verification. While verification is known to be computationally difficult theoretically, many techniques have been proposed for solving it in practice. It has been observed in the literature that by default neural networks rarely satisfy logical constraints that we want to verify. A good course of action is to train the given NN to satisfy said constraint prior to verifying them. This idea is sometimes referred to as continuous verification, referring to the loop between training and verification. Usually training with constraints is implemented by specifying a translation for a given formal logic language into loss functions. These loss functions are then used to train neural networks. Because for training purposes these functions need to be differentiable, these translations are called differentiable logics (DL). This raises several research questions. What kind of differentiable logics are possible? What difference does a specific choice of DL make in the context of continuous verification? What are the desirable criteria for a DL viewed from the point of view of the resulting loss function? In this extended abstract we will discuss and answer these questions.
    How do tuna schools associate to dFADs? A study using echo-sounder buoys to identify global patterns. (arXiv:2207.07049v1 [stat.ML])
    Based on the data gathered by echo-sounder buoys attached to drifting Fish Aggregating Devices (dFADs) across tropical oceans, the current study applies a Machine Learning protocol to examine the temporal trends of tuna schools' association to drifting objects. Using a binary output, metrics typically used in the literature were adapted to account for the fact that the entire tuna aggregation under the dFAD was considered. The median time it took tuna to colonize the dFADs for the first time varied between 25 and 43 days, depending on the ocean, and the longest soak and colonization times were registered in the Pacific Ocean. The tuna schools' Continuous Residence Times were generally shorter than Continuous Absence Times (median values between 5 and 7 days, and 9 and 11 days, respectively), in line with the results found by previous studies. Using a regression output, two novel metrics, namely aggregation time and disaggregation time, were estimated to obtain further insight into the symmetry of the aggregation process. Across all oceans, the time it took for the tuna aggregation to depart from the dFADs was not significantly longer than the time it took for the aggregation to form. The value of these results in the context of the "ecological trap" hypothesis is discussed, and further analyses to enrich and make use of this data source are proposed.
    RSD-GAN: Regularized Sobolev Defense GAN Against Speech-to-Text Adversarial Attacks. (arXiv:2207.06858v1 [cs.SD])
    This paper introduces a new synthesis-based defense algorithm for counteracting with a varieties of adversarial attacks developed for challenging the performance of the cutting-edge speech-to-text transcription systems. Our algorithm implements a Sobolev-based GAN and proposes a novel regularizer for effectively controlling over the functionality of the entire generative model, particularly the discriminator network during training. Our achieved results upon carrying out numerous experiments on the victim DeepSpeech, Kaldi, and Lingvo speech transcription systems corroborate the remarkable performance of our defense approach against a comprehensive range of targeted and non-targeted adversarial attacks.
    Rethinking Multidimensional Discriminator Output for Generative Adversarial Networks. (arXiv:2109.03378v3 [stat.ML] UPDATED)
    The study of multidimensional discriminator (critic) output for Generative Adversarial Networks has been underexplored in the literature. In this paper, we generalize the Wasserstein GAN framework to take advantage of multidimensional critic output and explore its properties. We also introduce a square-root velocity transformation (SRVT) block which favors training in the multidimensional setting. Proofs of properties are based on our proposed maximal p-centrality discrepancy, which is bounded above by p-Wasserstein distance and fits the Wasserstein GAN framework with multidimensional critic output n. Especially when n = 1 and p = 1, the proposed discrepancy equals 1-Wasserstein distance. Theoretical analysis and empirical evidence show that high-dimensional critic output has its advantage on distinguishing real and fake distributions, and benefits faster convergence and diversity of results.
    Virtual stain transfer in histology via cascaded deep neural networks. (arXiv:2207.06578v1 [physics.med-ph])
    Pathological diagnosis relies on the visual inspection of histologically stained thin tissue specimens, where different types of stains are applied to bring contrast to and highlight various desired histological features. However, the destructive histochemical staining procedures are usually irreversible, making it very difficult to obtain multiple stains on the same tissue section. Here, we demonstrate a virtual stain transfer framework via a cascaded deep neural network (C-DNN) to digitally transform hematoxylin and eosin (H&E) stained tissue images into other types of histological stains. Unlike a single neural network structure which only takes one stain type as input to digitally output images of another stain type, C-DNN first uses virtual staining to transform autofluorescence microscopy images into H&E and then performs stain transfer from H&E to the domain of the other stain in a cascaded manner. This cascaded structure in the training phase allows the model to directly exploit histochemically stained image data on both H&E and the target special stain of interest. This advantage alleviates the challenge of paired data acquisition and improves the image quality and color accuracy of the virtual stain transfer from H&E to another stain. We validated the superior performance of this C-DNN approach using kidney needle core biopsy tissue sections and successfully transferred the H&E-stained tissue images into virtual PAS (periodic acid-Schiff) stain. This method provides high-quality virtual images of special stains using existing, histochemically stained slides and creates new opportunities in digital pathology by performing highly accurate stain-to-stain transformations.
    Graph Neural Network Bandits. (arXiv:2207.06456v1 [cs.LG])
    We consider the bandit optimization problem with the reward function defined over graph-structured data. This problem has important applications in molecule design and drug discovery, where the reward is naturally invariant to graph permutations. The key challenges in this setting are scaling to large domains, and to graphs with many nodes. We resolve these challenges by embedding the permutation invariance into our model. In particular, we show that graph neural networks (GNNs) can be used to estimate the reward function, assuming it resides in the Reproducing Kernel Hilbert Space of a permutation-invariant additive kernel. By establishing a novel connection between such kernels and the graph neural tangent kernel (GNTK), we introduce the first GNN confidence bound and use it to design a phased-elimination algorithm with sublinear regret. Our regret bound depends on the GNTK's maximum information gain, which we also provide a bound for. While the reward function depends on all $N$ node features, our guarantees are independent of the number of graph nodes $N$. Empirically, our approach exhibits competitive performance and scales well on graph-structured domains.
    Leakage and the Reproducibility Crisis in ML-based Science. (arXiv:2207.07048v1 [cs.LG])
    The use of machine learning (ML) methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.
    Temporal Action Detection with Global Segmentation Mask Learning. (arXiv:2207.06580v1 [cs.CV])
    Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, we propose a proposal-free Temporal Action detection model with Global Segmentation mask (TAGS). Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length. The TAGS model differs significantly from the conventional proposal-based methods by focusing on global temporal representation learning to directly detect local start and end points of action instances without proposals. Further, by modeling TAD holistically rather than locally at the individual proposal level, TAGS needs a much simpler model architecture with lower computational cost. Extensive experiments show that despite its simpler design, TAGS outperforms existing TAD methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is ~ 20x faster to train and ~1.6x more efficient for inference. Our PyTorch implementation of TAGS is available at https://github.com/sauradip/TAGS .
    Antibody-Antigen Docking and Design via Hierarchical Equivariant Refinement. (arXiv:2207.06616v1 [q-bio.BM])
    Computational antibody design seeks to automatically create an antibody that binds to an antigen. The binding affinity is governed by the 3D binding interface where antibody residues (paratope) closely interact with antigen residues (epitope). Thus, predicting 3D paratope-epitope complex (docking) is the key to finding the best paratope. In this paper, we propose a new model called Hierarchical Equivariant Refinement Network (HERN) for paratope docking and design. During docking, HERN employs a hierarchical message passing network to predict atomic forces and use them to refine a binding complex in an iterative, equivariant manner. During generation, its autoregressive decoder progressively docks generated paratopes and builds a geometric representation of the binding interface to guide the next residue choice. Our results show that HERN significantly outperforms prior state-of-the-art on paratope docking and design benchmarks.
    Subgraph Frequency Distribution Estimation using Graph Neural Networks. (arXiv:2207.06684v1 [cs.LG])
    Small subgraphs (graphlets) are important features to describe fundamental units of a large network. The calculation of the subgraph frequency distributions has a wide application in multiple domains including biology and engineering. Unfortunately due to the inherent complexity of this task, most of the existing methods are computationally intensive and inefficient. In this work, we propose GNNS, a novel representational learning framework that utilizes graph neural networks to sample subgraphs efficiently for estimating their frequency distribution. Our framework includes an inference model and a generative model that learns hierarchical embeddings of nodes, subgraphs, and graph types. With the learned model and embeddings, subgraphs are sampled in a highly scalable and parallel way and the frequency distribution estimation is then performed based on these sampled subgraphs. Eventually, our methods achieve comparable accuracy and a significant speedup by three orders of magnitude compared to existing methods.
    In-memory Realization of In-situ Few-shot Continual Learning with a Dynamically Evolving Explicit Memory. (arXiv:2207.06810v1 [cs.LG])
    Continually learning new classes from a few training examples without forgetting previous old classes demands a flexible architecture with an inevitably growing portion of storage, in which new examples and classes can be incrementally stored and efficiently retrieved. One viable architectural solution is to tightly couple a stationary deep neural network to a dynamically evolving explicit memory (EM). As the centerpiece of this architecture, we propose an EM unit that leverages energy-efficient in-memory compute (IMC) cores during the course of continual learning operations. We demonstrate for the first time how the EM unit can physically superpose multiple training examples, expand to accommodate unseen classes, and perform similarity search during inference, using operations on an IMC core based on phase-change memory (PCM). Specifically, the physical superposition of a few encoded training examples is realized via in-situ progressive crystallization of PCM devices. The classification accuracy achieved on the IMC core remains within a range of 1.28%--2.5% compared to that of the state-of-the-art full-precision baseline software model on both the CIFAR-100 and miniImageNet datasets when continually learning 40 novel classes (from only five examples per class) on top of 60 old classes.
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v1 [stat.ML])
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.
    Proceedings of the ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts. (arXiv:2207.06958v1 [cs.SD])
    This is the Proceedings of the ICML Expressive Vocalization (ExVo) Competition. The ExVo competition focuses on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, included three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts. The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions. The third, ExVo-FewShot, requires participants to leverage few-shot learning incorporating speaker identity to train a model for the recognition of 10 emotions conveyed by vocal bursts.
    A Novel Implementation of Machine Learning for the Efficient, Explainable Diagnosis of COVID-19 from Chest CT. (arXiv:2207.07117v1 [eess.IV])
    In a worldwide health crisis as exigent as COVID-19, there has become a pressing need for rapid, reliable diagnostics. Currently, popular testing methods such as reverse transcription polymerase chain reaction (RT-PCR) can have high false negative rates. Consequently, COVID-19 patients are not accurately identified nor treated quickly enough to prevent transmission of the virus. However, the recent rise of medical CT data has presented promising avenues, since CT manifestations contain key characteristics indicative of COVID-19. This study aimed to take a novel approach in the machine learning-based detection of COVID-19 from chest CT scans. First, the dataset utilized in this study was derived from three major sources, comprising a total of 17,698 chest CT slices across 923 patient cases. Image preprocessing algorithms were then developed to reduce noise by excluding irrelevant features. Transfer learning was also implemented with the EfficientNetB7 pre-trained model to provide a backbone architecture and save computational resources. Lastly, several explainability techniques were leveraged to qualitatively validate model performance by localizing infected regions and highlighting fine-grained pixel details. The proposed model attained an overall accuracy of 0.927 and a sensitivity of 0.958. Explainability measures showed that the model correctly distinguished between relevant, critical features pertaining to COVID-19 chest CT images and normal controls. Deep learning frameworks provide efficient, human-interpretable COVID-19 diagnostics that could complement radiologist decisions or serve as an alternative screening tool. Future endeavors may provide insight into infection severity, patient risk stratification, and prognosis.
    Robot Program Parameter Inference via Differentiable Shadow Program Inversion. (arXiv:2103.14452v2 [cs.RO] UPDATED)
    Challenging manipulation tasks can be solved effectively by combining individual robot skills, which must be parameterized for the concrete physical environment and task at hand. This is time-consuming and difficult for human programmers, particularly for force-controlled skills. To this end, we present Shadow Program Inversion (SPI), a novel approach to infer optimal skill parameters directly from data. SPI leverages unsupervised learning to train an auxiliary differentiable program representation ("shadow program") and realizes parameter inference via gradient-based model inversion. Our method enables the use of efficient first-order optimizers to infer optimal parameters for originally non-differentiable skills, including many skill variants currently used in production. SPI zero-shot generalizes across task objectives, meaning that shadow programs do not need to be retrained to infer parameters for different task variants. We evaluate our methods on three different robots and skill frameworks in industrial and household scenarios. Code and examples are available at https://innolab.artiminds.com/icra2021.
    AutoML-Based Drought Forecast with Meteorological Variables. (arXiv:2207.07012v1 [cs.LG])
    A precise forecast for droughts is of considerable value to scientific research, agriculture, and water resource management. With emerging developments of data-driven approaches for hydro-climate modeling, this paper investigates an AutoML-based framework to forecast droughts in the U.S. Compared with commonly-used temporal deep learning models, the AutoML model can achieve comparable performance with less training data and time. As deep learning models are becoming popular for Earth system modeling, this paper aims to bring more efforts to AutoML-based methods, and the use of them as benchmark baselines for more complex deep learning models.
    The Free Energy Principle for Perception and Action: A Deep Learning Perspective. (arXiv:2207.06415v1 [cs.LG])
    The free energy principle, and its corollary active inference, constitute a bio-inspired theory that assumes biological agents act to remain in a restricted set of preferred states of the world, i.e., they minimize their free energy. Under this principle, biological agents learn a generative model of the world and plan actions in the future that will maintain the agent in an homeostatic state that satisfies its preferences. This framework lends itself to being realized in silico, as it comprehends important aspects that make it computationally affordable, such as variational inference and amortized planning. In this work, we investigate the tool of deep learning to design and realize artificial agents based on active inference, presenting a deep-learning oriented presentation of the free energy principle, surveying works that are relevant in both machine learning and active inference areas, and discussing the design choices that are involved in the implementation process. This manuscript probes newer perspectives for the active inference framework, grounding its theoretical aspects into more pragmatic affairs, offering a practical guide to active inference newcomers and a starting point for deep learning practitioners that would like to investigate implementations of the free energy principle.
    Pose-based Tremor Classification for Parkinson's Disease Diagnosis from Video. (arXiv:2207.06828v1 [cs.CV])
    Parkinson's disease (PD) is a progressive neurodegenerative disorder that results in a variety of motor dysfunction symptoms, including tremors, bradykinesia, rigidity and postural instability. The diagnosis of PD mainly relies on clinical experience rather than a definite medical test, and the diagnostic accuracy is only about 73-84% since it is challenged by the subjective opinions or experiences of different medical experts. Therefore, an efficient and interpretable automatic PD diagnosis system is valuable for supporting clinicians with more robust diagnostic decision-making. To this end, we propose to classify Parkinson's tremor since it is one of the most predominant symptoms of PD with strong generalizability. Different from other computer-aided time and resource-consuming Parkinson's Tremor (PT) classification systems that rely on wearable sensors, we propose SPAPNet, which only requires consumer-grade non-intrusive video recording of camera-facing human movements as input to provide undiagnosed patients with low-cost PT classification results as a PD warning sign. For the first time, we propose to use a novel attention module with a lightweight pyramidal channel-squeezing-fusion architecture to extract relevant PT information and filter the noise efficiently. This design aids in improving both classification performance and system interpretability. Experimental results show that our system outperforms state-of-the-arts by achieving a balanced accuracy of 90.9% and an F1-score of 90.6% in classifying PT with the non-PT class.
    Spatiotemporal Propagation Learning for Network-Wide Flight Delay Prediction. (arXiv:2207.06959v1 [cs.LG])
    Demystifying the delay propagation mechanisms among multiple airports is fundamental to precise and interpretable delay prediction, which is crucial during decision-making for all aviation industry stakeholders. The principal challenge lies in effectively leveraging the spatiotemporal dependencies and exogenous factors related to the delay propagation. However, previous works only consider limited spatiotemporal patterns with few factors. To promote more comprehensive propagation modeling for delay prediction, we propose SpatioTemporal Propagation Network (STPN), a space-time separable graph convolutional network, which is novel in spatiotemporal dependency capturing. From the aspect of spatial relation modeling, we propose a multi-graph convolution model considering both geographic proximity and airline schedule. From the aspect of temporal dependency capturing, we propose a multi-head self-attentional mechanism that can be learned end-to-end and explicitly reason multiple kinds of temporal dependency of delay time series. We show that the joint spatial and temporal learning models yield a sum of the Kronecker product, which factors the spatiotemporal dependence into the sum of several spatial and temporal adjacency matrices. By this means, STPN allows cross-talk of spatial and temporal factors for modeling delay propagation. Furthermore, a squeeze and excitation module is added to each layer of STPN to boost meaningful spatiotemporal features. To this end, we apply STPN to multi-step ahead arrival and departure delay prediction in large-scale airport networks. To validate the effectiveness of our model, we experiment with two real-world delay datasets, including U.S and China flight delays; and we show that STPN outperforms state-of-the-art methods. In addition, counterfactuals produced by STPN show that it learns explainable delay propagation patterns.
    Language Modelling with Pixels. (arXiv:2207.06991v1 [cs.CL])
    Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches, instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.
    DRIBO: Robust Deep Reinforcement Learning via Multi-View Information Bottleneck. (arXiv:2102.13268v4 [cs.AI] UPDATED)
    Deep reinforcement learning (DRL) agents are often sensitive to visual changes that were unseen in their training environments. To address this problem, we leverage the sequential nature of RL to learn robust representations that encode only task-relevant information from observations based on the unsupervised multi-view setting. Specifically, we introduce a novel contrastive version of the Multi-View Information Bottleneck (MIB) objective for temporal data. We train RL agents from pixels with this auxiliary objective to learn robust representations that can compress away task-irrelevant information and are predictive of task-relevant dynamics. This approach enables us to train high-performance policies that are robust to visual distractions and can generalize well to unseen environments. We demonstrate that our approach can achieve SOTA performance on a diverse set of visual control tasks in the DeepMind Control Suite when the background is replaced with natural videos. In addition, we show that our approach outperforms well-established baselines for generalization to unseen environments on the Procgen benchmark. Our code is open-sourced and available at https://github. com/BU-DEPEND-Lab/DRIBO.
    RobustAnalog: Fast Variation-Aware Analog Circuit Design Via Multi-task RL. (arXiv:2207.06412v1 [cs.ET])
    Analog/mixed-signal circuit design is one of the most complex and time-consuming stages in the whole chip design process. Due to various process, voltage, and temperature (PVT) variations from chip manufacturing, analog circuits inevitably suffer from performance degradation. Although there has been plenty of work on automating analog circuit design under the typical condition, limited research has been done on exploring robust designs under real and unpredictable silicon variations. Automatic analog design against variations requires prohibitive computation and time costs. To address the challenge, we present RobustAnalog, a robust circuit design framework that involves the variation information in the optimization process. Specifically, circuit optimizations under different variations are considered as a set of tasks. Similarities among tasks are leveraged and competitions are alleviated to realize a sample-efficient multi-task training. Moreover, RobustAnalog prunes the task space according to the current performance in each iteration, leading to a further simulation cost reduction. In this way, RobustAnalog can rapidly produce a set of circuit parameters that satisfies diverse constraints (e.g. gain, bandwidth, noise...) across variations. We compare RobustAnalog with Bayesian optimization, Evolutionary algorithm, and Deep Deterministic Policy Gradient (DDPG) and demonstrate that RobustAnalog can significantly reduce required optimization time by 14-30 times. Therefore, our study provides a feasible method to handle various real silicon conditions.
    A Robustly Optimized Long Text to Math Models for Numerical Reasoning On FinQA. (arXiv:2207.06490v1 [cs.CL])
    Numerical reasoning is required when solving most problems in our life, but it has been neglected in previous artificial intelligence researches. FinQA challenge has been organized to strengthen the study on numerical reasoning where the participants are asked to predict the numerical reasoning program to solve financial question. The result of FinQA will be evaluated by both execution accuracy and program accuracy. In this paper, we present our approach to tackle the task objective by developing models with different specialized capabilities and fusing their strength. Overall, our approach achieves the 1st place in FinQA challenge, with 71.93% execution accuracy and 67.03% program accuracy.
    Deep Learning Discovery of Demographic Biomarkers in Echocardiography. (arXiv:2207.06421v1 [cs.LG])
    Deep learning has been shown to accurately assess 'hidden' phenotypes and predict biomarkers from medical imaging beyond traditional clinician interpretation of medical imaging. Given the black box nature of artificial intelligence (AI) models, caution should be exercised in applying models to healthcare as prediction tasks might be short-cut by differences in demographics across disease and patient populations. Using large echocardiography datasets from two healthcare systems, we test whether it is possible to predict age, race, and sex from cardiac ultrasound images using deep learning algorithms and assess the impact of varying confounding variables. We trained video-based convolutional neural networks to predict age, sex, and race. We found that deep learning models were able to identify age and sex, while unable to reliably predict race. Without considering confounding differences between categories, the AI model predicted sex with an AUC of 0.85 (95% CI 0.84 - 0.86), age with a mean absolute error of 9.12 years (95% CI 9.00 - 9.25), and race with AUCs ranging from 0.63 - 0.71. When predicting race, we show that tuning the proportion of a confounding variable (sex) in the training data significantly impacts model AUC (ranging from 0.57 to 0.84), while in training a sex prediction model, tuning a confounder (race) did not substantially change AUC (0.81 - 0.83). This suggests a significant proportion of the model's performance on predicting race could come from confounding features being detected by AI. Further work remains to identify the particular imaging features that associate with demographic information and to better understand the risks of demographic identification in medical AI as it pertains to potentially perpetuating bias and disparities.
    In Defense of Core-set: A Density-aware Core-set Selection for Active Learning. (arXiv:2206.04838v3 [cs.LG] UPDATED)
    Active learning enables the efficient construction of a labeled dataset by labeling informative samples from an unlabeled dataset. In a real-world active learning scenario, considering the diversity of the selected samples is crucial because many redundant or highly similar samples exist. Core-set approach is the promising diversity-based method selecting diverse samples based on the distance between samples. However, the approach poorly performs compared to the uncertainty-based approaches that select the most difficult samples where neural models reveal low confidence. In this work, we analyze the feature space through the lens of the density and, interestingly, observe that locally sparse regions tend to have more informative samples than dense regions. Motivated by our analysis, we empower the core-set approach with the density-awareness and propose a density-aware core-set (DACS). The strategy is to estimate the density of the unlabeled samples and select diverse samples mainly from sparse regions. To reduce the computational bottlenecks in estimating the density, we also introduce a new density approximation based on locality-sensitive hashing. Experimental results clearly demonstrate the efficacy of DACS in both classification and regression tasks and specifically show that DACS can produce state-of-the-art performance in a practical scenario. Since DACS is weakly dependent on neural architectures, we present a simple yet effective combination method to show that the existing methods can be beneficially combined with DACS.
    Volatility Based Kernels and Moving Average Means for Accurate Forecasting with Gaussian Processes. (arXiv:2207.06544v1 [cs.LG])
    A broad class of stochastic volatility models are defined by systems of stochastic differential equations. While these models have seen widespread success in domains such as finance and statistical climatology, they typically lack an ability to condition on historical data to produce a true posterior distribution. To address this fundamental limitation, we show how to re-cast a class of stochastic volatility models as a hierarchical Gaussian process (GP) model with specialized covariance functions. This GP model retains the inductive biases of the stochastic volatility model while providing the posterior predictive distribution given by GP inference. Within this framework, we take inspiration from well studied domains to introduce a new class of models, Volt and Magpie, that significantly outperform baselines in stock and wind speed forecasting, and naturally extend to the multitask setting.
    Online Bayesian Meta-Learning for Cognitive Tracking Radar. (arXiv:2207.06917v1 [cs.IT])
    A key component of cognitive radar is the ability to generalize, or achieve consistent performance across a broad class of sensing environments, since aspects of the physical scene may vary over time. This presents a challenge for learning-based waveform selection approaches, since transmission policies which are effective in one scene may be highly suboptimal in another. One way to address this problem is to bias a learning algorithm strategically by exploiting high-level structure across tracking instances, referred to as meta-learning. In this work, we develop an online meta-learning approach for waveform-agile tracking. This approach uses information gained from previous target tracks to speed up and enhance learning in new tracking instances. This results in sample-efficient learning across a class of finite state target channels by exploiting inherent similarity across tracking scenes, attributed to common physical elements such as target type or clutter. We formulate the online waveform selection problem in the framework of Bayesian learning, and provide prior-dependent performance bounds for the meta-learning problem using PAC-Bayes theory. We present a computationally feasible posterior sampling algorithm and study the performance in a simulation study consisting of diverse scenes. Finally, we examine the potential performance benefits and practical challenges associated with online meta-learning for waveform-agile tracking.
    Improving Meta-learning for Low-resource Text Classification and Generation via Memory Imitation. (arXiv:2203.11670v2 [cs.CL] UPDATED)
    Building models of natural language processing (NLP) is challenging in low-resource scenarios where only limited data are available. Optimization-based meta-learning algorithms achieve promising results in low-resource scenarios by adapting a well-generalized model initialization to handle new tasks. Nonetheless, these approaches suffer from the memorization overfitting issue, where the model tends to memorize the meta-training tasks while ignoring support sets when adapting to new tasks. To address this issue, we propose a memory imitation meta-learning (MemIML) method that enhances the model's reliance on support sets for task adaptation. Specifically, we introduce a task-specific memory module to store support set information and construct an imitation module to force query sets to imitate the behaviors of some representative support-set samples stored in the memory. A theoretical analysis is provided to prove the effectiveness of our method, and empirical results also demonstrate that our method outperforms competitive baselines on both text classification and generation tasks.
    Bia Mitigation for Machine Learning Classifiers: A Comprehensive Survey. (arXiv:2207.07068v1 [cs.LG])
    This paper provides a comprehensive survey of bias mitigation methods for achieving fairness in Machine Learning (ML) models. We collect a total of 234 publications concerning bias mitigation for ML classifiers. These methods can be distinguished based on their intervention procedure (i.e., pre-processing, in-processing, post-processing) and the technology they apply. We investigate how existing bias mitigation methods are evaluated in the literature. In particular, we consider datasets, metrics and benchmarking. Based on the gathered insights (e.g., what is the most popular fairness metric? How many datasets are used for evaluating bias mitigation methods?). We hope to support practitioners in making informed choices when developing and evaluating new bias mitigation methods.
    Self-Play PSRO: Toward Optimal Populations in Two-Player Zero-Sum Games. (arXiv:2207.06541v1 [cs.GT])
    In competitive two-agent environments, deep reinforcement learning (RL) methods based on the \emph{Double Oracle (DO)} algorithm, such as \emph{Policy Space Response Oracles (PSRO)} and \emph{Anytime PSRO (APSRO)}, iteratively add RL best response policies to a population. Eventually, an optimal mixture of these population policies will approximate a Nash equilibrium. However, these methods might need to add all deterministic policies before converging. In this work, we introduce \emph{Self-Play PSRO (SP-PSRO)}, a method that adds an approximately optimal stochastic policy to the population in each iteration. Instead of adding only deterministic best responses to the opponent's least exploitable population mixture, SP-PSRO also learns an approximately optimal stochastic policy and adds it to the population as well. As a result, SP-PSRO empirically tends to converge much faster than APSRO and in many games converges in just a few iterations.
    Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets. (arXiv:2207.06920v1 [cs.SD])
    We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1st-stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh(.) on dense layer weights. In the 2nd-stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model's operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.
    Hypergraphon Mean Field Games. (arXiv:2203.16223v2 [cs.GT] UPDATED)
    We propose an approach to modelling large-scale multi-agent dynamical systems allowing interactions among more than just pairs of agents using the theory of mean-field games and the notion of hypergraphons, which are obtained as limits of large hypergraphs. To the best of our knowledge, ours is the first work on mean field games on hypergraphs. Together with an extension to a multi-layer setup, we obtain limiting descriptions for large systems of non-linear, weakly-interacting dynamical agents. On the theoretical side, we prove the well-foundedness of the resulting hypergraphon mean field game, showing both existence and approximate Nash properties. On the applied side, we extend numerical and learning algorithms to compute the hypergraphon mean field equilibria. To verify our approach empirically, we consider an epidemic control problem and a social rumor spreading model, where we give agents intrinsic motivation to spread rumors to unaware agents.
    Dynamically handling task disruptions by composing together behavior modules. (arXiv:2207.06482v1 [cs.LG])
    Biological neural networks operate in the presence of task disruptions as they guide organisms toward goals. A familiar stream of stimulus-response causations can be disrupted by subtask streams imposed by the environment. For example, taking a familiar path to a foraging area might be disrupted by the presence of a predator, necessitating a "detour" to the area. The detour can be a known alternative path that must be dynamically composed with the original path to accomplish the overall task. In this project, overarching base paths are disrupted by independently learned path modules in the form of insertion, substitution, and deletion modifications to the base paths such that the resulting modified paths are novel to the network. The network's performance is then tested on these paths that have been learned in piecemeal fashion. In sum, the network must compose a new task on the fly. Several network architectures are tested: Time delay neural network (TDNN), Long short-term memory (LSTM), Temporal convolutional network (TCN), and Morphognosis, a hierarchical neural network. LSTM and Morphognosis perform significantly better for this task.
    Speech-enhanced and Noise-aware Networks for Robust Speech Recognition. (arXiv:2203.13696v2 [cs.SD] UPDATED)
    Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task.
    Deep Unlearning via Randomized Conditionally Independent Hessians. (arXiv:2204.07655v2 [cs.CV] UPDATED)
    Recent legislation has led to interest in machine unlearning, i.e., removing specific training samples from a predictive model as if they never existed in the training dataset. Unlearning may also be required due to corrupted/adversarial data or simply a user's updated privacy requirement. For models which require no training (k-NN), simply deleting the closest original sample can be effective. But this idea is inapplicable to models which learn richer representations. Recent ideas leveraging optimization-based updates scale poorly with the model dimension d, due to inverting the Hessian of the loss function. We use a variant of a new conditional independence coefficient, L-CODEC, to identify a subset of the model parameters with the most semantic overlap on an individual sample level. Our approach completely avoids the need to invert a (possibly) huge matrix. By utilizing a Markov blanket selection, we premise that L-CODEC is also suitable for deep unlearning, as well as other applications in vision. Compared to alternatives, L-CODEC makes approximate unlearning possible in settings that would otherwise be infeasible, including vision models used for face recognition, person re-identification and NLP models that may require unlearning samples identified for exclusion. Code can be found at https://github.com/vsingh-group/LCODEC-deep-unlearning/
    Protein 3D structure-based neural networks highly improve the accuracy in compound-protein binding affinity prediction. (arXiv:2204.12586v2 [q-bio.BM] UPDATED)
    Theoretically, the accuracy of computational models in compound-protein binding affinities (CPAs) could be improved by the introduction of protein 3D structure information. However, most of these models still suffer from low accuracy due to the lack of an efficient approach to encode informative protein features. The major challenge is how to combine the multi-modal information such as the residue sequence of the protein, residue atom coordinates and the torsion angles. To tackle this problem, we develop Fast Evolutional Attention and Thoroughgoing-graph Neural Networks (FeatNN) to facilitate the application of protein 3D structure information for predicting CPAs. Specifically, we established a novel end-to-end architecture to jointly embed torsion matrix, discrete distance matrix, and sequence information of protein and extract compound features with deep graph convolution layers. In addition, a new pairwise mapping attention mechanism is introduced to comprehensively learn potential interaction information between proteins and compounds. FeatNN considerably outperforms various state-of-the-art baselines in CPA prediction with the R2 coefficient elevated by about 21.33%. Thus, FeatNN provides an outstanding method for highly accurate CPA prediction and facilitates high-throughput virtual screening of drug candidates.
    Improved Binary Forward Exploration: Learning Rate Scheduling Method for Stochastic Optimization. (arXiv:2207.04198v2 [cs.LG] UPDATED)
    A new gradient-based optimization approach by automatically scheduling the learning rate has been proposed recently, which is called Binary Forward Exploration (BFE). The Adaptive version of BFE has also been discussed thereafter. In this paper, the improved algorithms based on them will be investigated, in order to optimize the efficiency and robustness of the new methodology. This improved approach provides a new perspective to scheduling the update of learning rate and will be compared with the stochastic gradient descent, aka SGD algorithm with momentum or Nesterov momentum and the most successful adaptive learning rate algorithm e.g. Adam. The goal of this method does not aim to beat others but provide a different viewpoint to optimize the gradient descent process. This approach combines the advantages of the first-order and second-order optimizations in the aspects of speed and efficiency.
    Interference-Limited Ultra-Reliable and Low-Latency Communications: Graph Neural Networks or Stochastic Geometry?. (arXiv:2207.06918v1 [eess.SP])
    In this paper, we aim to improve the Quality-of-Service (QoS) of Ultra-Reliability and Low-Latency Communications (URLLC) in interference-limited wireless networks. To obtain time diversity within the channel coherence time, we first put forward a random repetition scheme that randomizes the interference power. Then, we optimize the number of reserved slots and the number of repetitions for each packet to minimize the QoS violation probability, defined as the percentage of users that cannot achieve URLLC. We build a cascaded Random Edge Graph Neural Network (REGNN) to represent the repetition scheme and develop a model-free unsupervised learning method to train it. We analyze the QoS violation probability using stochastic geometry in a symmetric scenario and apply a model-based Exhaustive Search (ES) method to find the optimal solution. Simulation results show that in the symmetric scenario, the QoS violation probabilities achieved by the model-free learning method and the model-based ES method are nearly the same. In more general scenarios, the cascaded REGNN generalizes very well in wireless networks with different scales, network topologies, cell densities, and frequency reuse factors. It outperforms the model-based ES method in the presence of the model mismatch.
    Improving self-supervised pretraining models for epileptic seizure detection from EEG data. (arXiv:2207.06911v1 [eess.SP])
    There is abundant medical data on the internet, most of which are unlabeled. Traditional supervised learning algorithms are often limited by the amount of labeled data, especially in the medical domain, where labeling is costly in terms of human processing and specialized experts needed to label them. They are also prone to human error and biased as a select few expert annotators label them. These issues are mitigated by Self-supervision, where we generate pseudo-labels from unlabelled data by seeing the data itself. This paper presents various self-supervision strategies to enhance the performance of a time-series based Diffusion convolution recurrent neural network (DCRNN) model. The learned weights in the self-supervision pretraining phase can be transferred to the supervised training phase to boost the model's prediction capability. Our techniques are tested on an extension of a Diffusion Convolutional Recurrent Neural network (DCRNN) model, an RNN with graph diffusion convolutions, which models the spatiotemporal dependencies present in EEG signals. When the learned weights from the pretraining stage are transferred to a DCRNN model to determine whether an EEG time window has a characteristic seizure signal associated with it, our method yields an AUROC score $1.56\%$ than the current state-of-the-art models on the TUH EEG seizure corpus.
    Soil Erosion in the United States. Present and Future (2020-2050). (arXiv:2207.06579v1 [physics.ao-ph])
    Soil erosion is a significant threat to the environment and long-term land management around the world. Accelerated soil erosion by human activities inflicts extreme changes in terrestrial and aquatic ecosystems, which is not fully surveyed/predicted for the present and probable future at field-scales (30-m). Here, we estimate/predict soil erosion rates by water erosion, (sheet and rill erosion), using three alternative (2.6, 4.5, and 8.5) Shared Socioeconomic Pathway and Representative Concentration Pathway (SSP-RCP) scenarios across the contiguous United States. Field Scale Soil Erosion Model (FSSLM) estimations rely on a high resolution (30-m) G2 erosion model integrated by satellite- and imagery-based estimations of land use and land cover (LULC), gauge observations of long-term precipitation, and scenarios of the Coupled Model Intercomparison Project Phase 6 (CMIP6). The baseline model (2020) estimates soil erosion rates of 2.32 Mg ha 1 yr 1 with current agricultural conservation practices (CPs). Future scenarios with current CPs indicate an increase between 8% to 21% under different combinations of SSP-RCP scenarios of climate and LULC changes. The soil erosion forecast for 2050 suggests that all the climate and LULC scenarios indicate either an increase in extreme events or a change in the spatial location of extremes largely from the southern to the eastern and northeastern regions of the United States.
    Strongly Augmented Contrastive Clustering. (arXiv:2206.00380v2 [cs.LG] UPDATED)
    Deep clustering has attracted increasing attention in recent years due to its capability of joint representation learning and clustering via deep neural networks. In its latest developments, the contrastive learning has emerged as an effective technique to substantially enhance the deep clustering performance. However, the existing contrastive learning based deep clustering algorithms mostly focus on some carefully-designed augmentations (often with limited transformations to preserve the structure), referred to as weak augmentations, but cannot go beyond the weak augmentations to explore the more opportunities in stronger augmentations (with more aggressive transformations or even severe distortions). In this paper, we present an end-to-end deep clustering approach termed Strongly Augmented Contrastive Clustering (SACC), which extends the conventional two-augmentation-view paradigm to multiple views and jointly leverages strong and weak augmentations for strengthened deep clustering. Particularly, we utilize a backbone network with triply-shared weights, where a strongly augmented view and two weakly augmented views are incorporated. Based on the representations produced by the backbone, the weak-weak view pair and the strong-weak view pairs are simultaneously exploited for the instance-level contrastive learning (via an instance projector) and the cluster-level contrastive learning (via a cluster projector), which, together with the backbone, can be jointly optimized in a purely unsupervised manner. Experimental results on five challenging image datasets have shown the superiority of our SACC approach over the state-of-the-art. The code is available at https://github.com/dengxiaozhi/SACC.
    T-RECX: Tiny-Resource Efficient Convolutional Neural Networks with Early-Exit. (arXiv:2207.06613v1 [cs.LG])
    Deploying Machine learning (ML) on the milliwatt-scale edge devices (tinyML) is gaining popularity due to recent breakthroughs in ML and IoT. However, the capabilities of tinyML are restricted by strict power and compute constraints. The majority of the contemporary research in tinyML focuses on model compression techniques such as model pruning and quantization to fit ML models on low-end devices. Nevertheless, the improvements in energy consumption and inference time obtained by existing techniques are limited because aggressive compression quickly shrinks model capacity and accuracy. Another approach to improve inference time and/or reduce power while preserving its model capacity is through early-exit networks. These networks place intermediate classifiers along a baseline neural network that facilitate early exit from neural network computation if an intermediate classifier exhibits sufficient confidence in its prediction. Previous work on early-exit networks have focused on large networks, beyond what would typically be used for tinyML applications. In this paper, we discuss the challenges of adding early-exits to state-of-the-art tiny-CNNs and devise an early-exit architecture, T-RECX, that addresses these challenges. In addition, we develop a method to alleviate the effect of network overthinking at the final exit by leveraging the high-level representations learned by the early-exit. We evaluate T-RECX on three CNNs from the MLPerf tiny benchmark suite for image classification, keyword spotting and visual wake word detection tasks. Our results demonstrate that T-RECX improves the accuracy of baseline network and significantly reduces the average inference time of tiny-CNNs. T-RECX achieves 32.58% average reduction in FLOPS in exchange for 1% accuracy across all evaluated models. Also, our techniques increase the accuracy of baseline network in two out of three models we evaluate
    Attention mechanisms for physiological signal deep learning: which attention should we take?. (arXiv:2207.06904v1 [eess.SP])
    Attention mechanisms are widely used to dramatically improve deep learning model performance in various fields. However, their general ability to improve the performance of physiological signal deep learning model is immature. In this study, we experimentally analyze four attention mechanisms (e.g., squeeze-and-excitation, non-local, convolutional block attention module, and multi-head self-attention) and three convolutional neural network (CNN) architectures (e.g., VGG, ResNet, and Inception) for two representative physiological signal prediction tasks: the classification for predicting hypotension and the regression for predicting cardiac output (CO). We evaluated multiple combinations for performance and convergence of physiological signal deep learning model. Accordingly, the CNN models with the spatial attention mechanism showed the best performance in the classification problem, whereas the channel attention mechanism achieved the lowest error in the regression problem. Moreover, the performance and convergence of the CNN models with attention mechanisms were better than stand-alone self-attention models in both problems. Hence, we verified that convolutional operation and attention mechanisms are complementary and provide faster convergence time, despite the stand-alone self-attention models requiring fewer parameters.
    A Bayesian Lasso based Sparse Learning Model. (arXiv:1908.07220v3 [stat.ML] UPDATED)
    The Bayesian Lasso is constructed in the linear regression framework and applies the Gibbs sampling to estimate the regression parameters. This paper develops a new sparse learning model, named the Bayesian Lasso Sparse (BLS) model, that takes the hierarchical model formulation of the Bayesian Lasso. The main difference from the original Bayesian Lasso lies in the estimation procedure; the BLS method uses a learning algorithm based on the type-II maximum likelihood procedure. Opposed to the Bayesian Lasso, the BLS provides sparse estimates of the regression parameters. The BLS method is also derived for nonlinear supervised learning problems by introducing kernel functions. We compare the BLS model to the well known Relevance Vector Machine, the Fast Laplace method, the Byesian Lasso, and the Lasso, on both simulated and real data. The numerical results show that the BLS is sparse and precise, especially when dealing with noisy and irregular dataset.
    Insurgency as Complex Network: Image Co-Appearance and Hierarchy in the PKK. (arXiv:2207.06946v1 [cs.SI])
    Despite a growing recognition of the importance of insurgent group structure on conflict outcomes, there is very little empirical research thereon. Though this problem is rooted in the inaccessibility of data on militant group structure, insurgents frequently publish large volumes of image data on the internet. In this paper, I develop a new methodology that leverages this abundant but underutilized source of data by automating the creation of a social network graph based on co-appearance in photographs using deep learning. Using a trove of 19,115 obituary images published online by the PKK, a Kurdish militant group in Turkey, I demonstrate that an individual's centrality in the resulting co-appearance network is closely correlated with their rank in the insurgent group.
    Detecting People Interested in Non-Suicidal Self-Injury on Social Media. (arXiv:2207.07014v1 [cs.SI])
    We propose a supervised learning approach to detect people interested in Non-Suicidal Self-Injury (NSSI). We treat the task as a binary classification problem, and build classifiers based upon features extracted from people self-declared interests. Experimental evaluation on a real-world dataset, the LiveJournal social blogging networking platform, demonstrates the effectiveness of our proposed model.
    Combating Distribution Shift for Accurate Time Series Forecasting via Hypernetworks. (arXiv:2202.10808v2 [cs.LG] UPDATED)
    Time series forecasting has widespread applications in urban life ranging from air quality monitoring to traffic analysis. However, accurate time series forecasting is challenging because real-world time series suffer from the distribution shift problem, where their statistical properties change over time. Despite extensive solutions to distribution shifts in domain adaptation or generalization, they fail to function effectively in unknown, constantly-changing distribution shifts, which are common in time series. In this paper, we propose Hyper Time- Series Forecasting (HTSF), a hypernetwork-based framework for accurate time series forecasting under distribution shift. HTSF jointly learns the time-varying distributions and the corresponding forecasting models in an end-to-end fashion. Specifically, HTSF exploits the hyper layers to learn the best characterization of the distribution shifts, generating the model parameters for the main layers to make accurate predictions. We implement HTSF as an extensible framework that can incorporate diverse time series forecasting models such as RNNs and Transformers. Extensive experiments on 9 benchmarks demonstrate that HTSF achieves state-of-the-art performances.
    Neural Networks for Encoding Dynamic Security-Constrained Optimal Power Flow. (arXiv:2003.07939v5 [eess.SY] UPDATED)
    This paper introduces a framework to capture previously intractable optimization constraints and transform them to a mixed-integer linear program, through the use of neural networks. We encode the feasible space of optimization problems characterized by both tractable and intractable constraints, e.g. differential equations, to a neural network. Leveraging an exact mixed-integer reformulation of neural networks, we solve mixed-integer linear programs that accurately approximate solutions to the originally intractable non-linear optimization problem. We apply our methods to the AC optimal power flow problem (AC-OPF), where directly including dynamic security constraints renders the AC-OPF intractable. Our proposed approach has the potential to be significantly more scalable than traditional approaches. We demonstrate our approach for power system operation considering N-1 security and small-signal stability, showing how it can efficiently obtain cost-optimal solutions which at the same time satisfy both static and dynamic security constraints.
    Ranking and Tuning Pre-trained Models: A New Paradigm for Exploiting Model Hubs. (arXiv:2110.10545v4 [cs.LG] UPDATED)
    Model hubs with many pre-trained models (PTMs) have become a cornerstone of deep learning. Although built at a high cost, they remain \emph{under-exploited} -- practitioners usually pick one PTM from the provided model hub by popularity and then fine-tune the PTM to solve the target task. This na\"ive but common practice poses two obstacles to full exploitation of pre-trained model hubs: first, the PTM selection by popularity has no optimality guarantee, and second, only one PTM is used while the remaining PTMs are ignored. An alternative might be to consider all possible combinations of PTMs and extensively fine-tune each combination, but this would not only be prohibitive computationally but may also lead to statistical over-fitting. In this paper, we propose a new paradigm for exploiting model hubs that is intermediate between these extremes. The paradigm is characterized by two aspects: (1) We use an evidence maximization procedure to estimate the maximum value of label evidence given features extracted by pre-trained models. This procedure can rank all the PTMs in a model hub for various types of PTMs and tasks \emph{before fine-tuning}. (2) The best ranked PTM can either be fine-tuned and deployed if we have no preference for the model's architecture or the target PTM can be tuned by the top $K$ ranked PTMs via a Bayesian procedure that we propose. This procedure, which we refer to as \emph{B-Tuning}, not only improves upon specialized methods designed for tuning homogeneous PTMs, but also applies to the challenging problem of tuning heterogeneous PTMs where it yields a new level of benchmark performance.
    Learning to Parallelize in a Shared-Memory Environment with Transformers. (arXiv:2204.12835v4 [cs.DC] UPDATED)
    In past years, the world has switched to many-core and multi-core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes to software applications. OpenMP is the most comprehensive API that implements such schemes, characterized by a readable interface. Nevertheless, introducing OpenMP into code is challenging due to pervasive pitfalls in management of parallel shared memory. To facilitate the performance of this task, many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. In this work, we propose leveraging recent advances in ML techniques, specifically in natural language processing (NLP), to replace S2S compilers altogether. We create a database (corpus), Open-OMP, specifically for this goal. Open-OMP contains over 28,000 code snippets, half of which contain OpenMP directives while the other half do not need parallelization at all with high probability. We use the corpus to train systems to automatically classify code segments in need of parallelization, as well as suggest individual OpenMP clauses. We train several transformer models, named PragFormer, for these tasks, and show that they outperform statistically-trained baselines and automatic S2S parallelization compilers in both classifying the overall need for an OpenMP directive and the introduction of private and reduction clauses. Our source code and database are available at: https://github.com/Scientific-Computing-Lab-NRCN/PragFormer.
    Feature robustness and sex differences in medical imaging: a case study in MRI-based Alzheimer's disease detection. (arXiv:2204.01737v3 [eess.IV] UPDATED)
    Convolutional neural networks have enabled significant improvements in medical image-based diagnosis. It is, however, increasingly clear that these models are susceptible to performance degradation when facing spurious correlations and dataset shift, leading, e.g., to underperformance on underrepresented patient groups. In this paper, we compare two classification schemes on the ADNI MRI dataset: a simple logistic regression model using manually selected volumetric features, and a convolutional neural network trained on 3D MRI data. We assess the robustness of the trained models in the face of varying dataset splits, training set sex composition, and stage of disease. In contrast to earlier work in other imaging modalities, we do not observe a clear pattern of improved model performance for the majority group in the training dataset. Instead, while logistic regression is fully robust to dataset composition, we find that CNN performance is generally improved for both male and female subjects when including more female subjects in the training dataset. We hypothesize that this might be due to inherent differences in the pathology of the two sexes. Moreover, in our analysis, the logistic regression model outperforms the 3D CNN, emphasizing the utility of manual feature specification based on prior knowledge, and the need for more robust automatic feature selection.
    One Model to Unite Them All: Personalized Federated Learning of Multi-Contrast MRI Synthesis. (arXiv:2207.06509v1 [eess.IV])
    Learning-based MRI translation involves a synthesis model that maps a source-contrast onto a target-contrast image. Multi-institutional collaborations are key to training synthesis models across broad datasets, yet centralized training involves privacy risks. Federated learning (FL) is a collaboration framework that instead adopts decentralized training to avoid sharing imaging data and mitigate privacy concerns. However, FL-trained models can be impaired by the inherent heterogeneity in the distribution of imaging data. On the one hand, implicit shifts in image distribution are evident across sites, even for a common translation task with fixed source-target configuration. Conversely, explicit shifts arise within and across sites when diverse translation tasks with varying source-target configurations are prescribed. To improve reliability against domain shifts, here we introduce the first personalized FL method for MRI Synthesis (pFLSynth). pFLSynth is based on an adversarial model equipped with a mapper that produces latents specific to individual sites and source-target contrasts. It leverages novel personalization blocks that adaptively tune the statistics and weighting of feature maps across the generator based on these latents. To further promote site-specificity, partial model aggregation is employed over downstream layers of the generator while upstream layers are retained locally. As such, pFLSynth enables training of a unified synthesis model that can reliably generalize across multiple sites and translation tasks. Comprehensive experiments on multi-site datasets clearly demonstrate the enhanced performance of pFLSynth against prior federated methods in multi-contrast MRI synthesis.
    Learning to Prove Trigonometric Identities. (arXiv:2207.06679v1 [cs.LG])
    Automatic theorem proving with deep learning methods has attracted attentions recently. In this paper, we construct an automatic proof system for trigonometric identities. We define the normalized form of trigonometric identities, design a set of rules for the proof and put forward a method which can generate theoretically infinite trigonometric identities. Our goal is not only to complete the proof, but to complete the proof in as few steps as possible. For this reason, we design a model to learn proof data generated by random BFS (rBFS), and it is proved theoretically and experimentally that the model can outperform rBFS after a simple imitation learning. After further improvement through reinforcement learning, we get AutoTrig, which can give proof steps for identities in almost as short steps as BFS (theoretically shortest method), with a time cost of only one-thousandth. In addition, AutoTrig also beats Sympy, Matlab and human in the synthetic dataset, and performs well in many generalization tasks.
    Every Preference Changes Differently: Neural Multi-Interest Preference Model with Temporal Dynamics for Recommendation. (arXiv:2207.06652v1 [cs.IR])
    User embeddings (vectorized representations of a user) are essential in recommendation systems. Numerous approaches have been proposed to construct a representation for the user in order to find similar items for retrieval tasks, and they have been proven effective in industrial recommendation systems as well. Recently people have discovered the power of using multiple embeddings to represent a user, with the hope that each embedding represents the user's interest in a certain topic. With multi-interest representation, it's important to model the user's preference over the different topics and how the preference change with time. However, existing approaches either fail to estimate the user's affinity to each interest or unreasonably assume every interest of every user fades with an equal rate with time, thus hurting the recall of candidate retrieval. In this paper, we propose the Multi-Interest Preference (MIP) model, an approach that not only produces multi-interest for users by using the user's sequential engagement more effectively but also automatically learns a set of weights to represent the preference over each embedding so that the candidates can be retrieved from each interest proportionally. Extensive experiments have been done on various industrial-scale datasets to demonstrate the effectiveness of our approach.
    Adaptive Attitude Estimation Using a Hybrid Model-Learning Approach. (arXiv:2207.06903v1 [eess.SP])
    Attitude determination using the smartphone's inertial sensors poses a major challenge due to the sensor low-performance grade and variate nature of the walking pedestrian. In this paper, data-driven techniques are employed to address that challenge. To that end, a hybrid deep learning and model based solution for attitude estimation is proposed. Here, classical model based equations are applied to form an adaptive complementary filter structure. Instead of using constant or model based adaptive weights, the accelerometer weights in each axis are determined by a unique neural network. The performance of the proposed hybrid approach is evaluated relative to popular model based approaches using experimental data.
    Semi-Supervised Temporal Action Detection with Proposal-Free Masking. (arXiv:2207.07059v1 [cs.CV])
    Existing temporal action detection (TAD) methods rely on a large number of training data with segment-level annotations. Collecting and annotating such a training set is thus highly expensive and unscalable. Semi-supervised TAD (SS-TAD) alleviates this problem by leveraging unlabeled videos freely available at scale. However, SS-TAD is also a much more challenging problem than supervised TAD, and consequently much under-studied. Prior SS-TAD methods directly combine an existing proposal-based TAD method and a SSL method. Due to their sequential localization (e.g, proposal generation) and classification design, they are prone to proposal error propagation. To overcome this limitation, in this work we propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT) with a parallel localization (mask generation) and classification architecture. Such a novel design effectively eliminates the dependence between localization and classification by cutting off the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for prediction refinement, and a new pretext task for self-supervised model pre-training. Extensive experiments on two standard benchmarks show that our SPOT outperforms state-of-the-art alternatives, often by a large margin. The PyTorch implementation of SPOT is available at https://github.com/sauradip/SPOT
    MorphoActivation: Generalizing ReLU activation function by mathematical morphology. (arXiv:2207.06413v1 [cs.LG])
    This paper analyses both nonlinear activation functions and spatial max-pooling for Deep Convolutional Neural Networks (DCNNs) by means of the algebraic basis of mathematical morphology. Additionally, a general family of activation functions is proposed by considering both max-pooling and nonlinear operators in the context of morphological representations. Experimental section validates the goodness of our approach on classical benchmarks for supervised learning by DCNN.
    Frequency-Encoded Deep Learning with Speed-of-Light Dominated Latency. (arXiv:2207.06883v1 [cs.ET])
    The ability of deep neural networks to perform complex tasks more accurately than manually-crafted solutions has created a substantial demand for more complex models processing larger amounts of data. However, the traditional computing architecture has reached a bottleneck in processing performance due to data movement from memory to computing. Considerable efforts have been made towards custom hardware acceleration, among which are optical neural networks (ONNs). These excel at energy efficient linear operations but struggle with scalability and the integration of linear and nonlinear functions. Here, we introduce our multiplicative analog frequency transform optical neural network (MAFT-ONN) that encodes the data in the frequency domain to compute matrix-vector products in a single-shot using a single photoelectric multiplication, and then implements the nonlinear activation for all neurons using a single electro-optic modulator. We experimentally demonstrate a 3-layer DNN with our architecture using a simple hardware setup assembled with commercial components. Additionally, this is the first DNN hardware accelerator suitable for analog inference of temporal waveforms like voice or radio signals, achieving bandwidth-limited throughput and speed-of-light limited latency. Our results demonstrate a highly scalable ONN with a straightforward path to surpassing the current computing bottleneck, in addition to enabling new possibilities for high-performance analog deep learning of temporal waveforms.
    Mirror Learning: A Unifying Framework of Policy Optimisation. (arXiv:2201.02373v10 [cs.LG] UPDATED)
    Modern deep reinforcement learning (RL) algorithms are motivated by either the generalised policy iteration (GPI) or trust-region learning (TRL) frameworks. However, algorithms that strictly respect these theoretical frameworks have proven unscalable. Surprisingly, the only known scalable algorithms violate the GPI/TRL assumptions, e.g. due to required regularisation or other heuristics. The current explanation of their empirical success is essentially "by analogy": they are deemed approximate adaptations of theoretically sound methods. Unfortunately, studies have shown that in practice these algorithms differ greatly from their conceptual ancestors. In contrast, in this paper we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO. While the latter two exploit the flexibility of our framework, GPI and TRL fit in merely as pathologically restrictive corner cases thereof. This suggests that the empirical performance of state-of-the-art methods is a direct consequence of their theoretical properties, rather than of aforementioned approximate analogies. Mirror learning sets us free to boldly explore novel, theoretically sound RL algorithms, a thus far uncharted wonderland.
    Strain-Minimizing Hyperbolic Network Embeddings with Landmarks. (arXiv:2207.06775v1 [stat.CO])
    We introduce L-hydra (landmarked hyperbolic distance recovery and approximation), a method for embedding network- or distance-based data into hyperbolic space, which requires only the distance measurements to a few 'landmark nodes'. This landmark heuristic makes L-hydra applicable to large-scale graphs and improves upon previously introduced methods. As a mathematical justification, we show that a point configuration in d-dimensional hyperbolic space can be perfectly recovered (up to isometry) from distance measurements to just d+1 landmarks. We also show that L-hydra solves a two-stage strain-minimization problem, similar to our previous (unlandmarked) method 'hydra'. Testing on real network data, we show that L-hydra is an order of magnitude faster than existing hyperbolic embedding methods and scales linearly in the number of nodes. While the embedding error of L-hydra is higher than the error of existing methods, we introduce an extension, L-hydra+, which outperforms existing methods in both runtime and embedding quality.
    Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. (arXiv:2207.06569v1 [cs.LG])
    The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied $\textit{benign overfitting}$, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks $\textit{do not fit benignly}$: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime $\textit{tempered overfitting}$, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.
    Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach. (arXiv:2109.06332v3 [cs.LG] UPDATED)
    Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve $\epsilon$-optimal cumulative reward with $\epsilon$ feasible policies. An $\epsilon$-feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve $\epsilon$-optimal cumulative reward with zero constraint violations or not. To achieve that, we advocate the use of randomized primal-dual approach to solve the CMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA) which is shown to exhibit $\tilde{\mathcal{O}}\left(1/\epsilon^2\right)$ sample complexity to achieve $\epsilon$-optimal cumulative reward with zero constraint violations. In the prior works, the best available sample complexity for the $\epsilon$-optimal policy with zero constraint violation is $\tilde{\mathcal{O}}\left(1/\epsilon^5\right)$. Hence, the proposed algorithm provides a significant improvement as compared to the state of the art.
    Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers. (arXiv:2207.07087v1 [cs.CL])
    Prompt tuning attempts to update few task-specific parameters in pre-trained models. It has achieved comparable performance to fine-tuning of the full parameter set on both language understanding and generation tasks. In this work, we study the problem of prompt tuning for neural text retrievers. We introduce parameter-efficient prompt tuning for text retrieval across in-domain, cross-domain, and cross-topic settings. Through an extensive analysis, we show that the strategy can mitigate the two issues -- parameter-inefficiency and weak generalizability -- faced by fine-tuning based retrieval methods. Notably, it can significantly improve the out-of-domain zero-shot generalization of the retrieval models. By updating only 0.1% of the model parameters, the prompt tuning strategy can help retrieval models achieve better generalization performance than traditional methods in which all parameters are updated. Finally, to facilitate research on retrievers' cross-topic generalizability, we curate and release an academic retrieval dataset with 18K query-results pairs in 87 topics, making it the largest topic-specific one to date.
    Forming Trees with Treeformers. (arXiv:2207.06960v1 [cs.CL])
    Popular models such as Transformers and LSTMs use tokens as its unit of information. That is, each token is encoded into a vector representation, and those vectors are used directly in a computation. However, humans frequently consider spans of tokens (i.e., phrases) instead of their constituent tokens. In this paper we introduce Treeformer, an architecture inspired by the CKY algorithm and Transformer which learns a composition operator and pooling function in order to construct hierarchical encodings for phrases and sentences. Our extensive experiments demonstrate the benefits of incorporating a hierarchical structure into the Transformer, and show significant improvements compared to a baseline Transformer in machine translation, abstractive summarization, and various natural language understanding tasks.
    A Generalized Framework for Microstructural Optimization using Neural Networks. (arXiv:2207.06512v1 [cond-mat.mtrl-sci])
    Microstructures, i.e., architected materials, are designed today, typically, by maximizing an objective, such as bulk modulus, subject to a volume constraint. However, in many applications, it is often more appropriate to impose constraints on other physical quantities of interest. In this paper, we consider such generalized microstructural optimization problems where any of the microstructural quantities, namely, bulk, shear, Poisson ratio, or volume, can serve as the objective, while the remaining can serve as constraints. In particular, we propose here a neural-network (NN) framework to solve such problems. The framework relies on the classic density formulation of microstructural optimization, but the density field is represented through the NN's weights and biases. The main characteristics of the proposed NN framework are: (1) it supports automatic differentiation, eliminating the need for manual sensitivity derivations, (2) smoothing filters are not required due to implicit filtering, (3) the framework can be easily extended to multiple-materials, and (4) a high-resolution microstructural topology can be recovered through a simple post-processing step. The framework is illustrated through a variety of microstructural optimization problems.
    MDEAW: A Multimodal Dataset for Emotion Analysis through EDA and PPG signals from wireless wearable low-cost off-the-shelf Devices. (arXiv:2207.06410v1 [cs.HC])
    We present MDEAW, a multimodal database consisting of Electrodermal Activity (EDA) and Photoplethysmography (PPG) signals recorded during the exams for the course taught by the teacher at Eurecat Academy, Sabadell, Barcelona in order to elicit the emotional reactions to the students in a classroom scenario. Signals from 10 students were recorded along with the students' self-assessment of their affective state after each stimulus, in terms of 6 basic emotion states. All the signals were captured using portable, wearable, wireless, low-cost, and off-the-shelf equipment that has the potential to allow the use of affective computing methods in everyday applications. A baseline for student-wise affect recognition using EDA and PPG-based features, as well as their fusion, was established through ReMECS, Fed-ReMECS, and Fed-ReMECS-U. These results indicate the prospects of using low-cost devices for affective state recognition applications. The proposed database will be made publicly available in order to allow researchers to achieve a more thorough evaluation of the suitability of these capturing devices for emotion state recognition applications.  ( 2 min )
    Modeling Long-term Dependencies and Short-term Correlations in Patient Journey Data with Temporal Attention Networks for Health Prediction. (arXiv:2207.06414v1 [cs.LG])
    Building models for health prediction based on Electronic Health Records (EHR) has become an active research area. EHR patient journey data consists of patient time-ordered clinical events/visits from patients. Most existing studies focus on modeling long-term dependencies between visits, without explicitly taking short-term correlations between consecutive visits into account, where irregular time intervals, incorporated as auxiliary information, are fed into health prediction models to capture latent progressive patterns of patient journeys. We present a novel deep neural network with four modules to take into account the contributions of various variables for health prediction: i) the Stacked Attention module strengthens the deep semantics in clinical events within each patient journey and generates visit embeddings, ii) the Short-Term Temporal Attention module models short-term correlations between consecutive visit embeddings while capturing the impact of time intervals within those visit embeddings, iii) the Long-Term Temporal Attention module models long-term dependencies between visit embeddings while capturing the impact of time intervals within those visit embeddings, iv) and finally, the Coupled Attention module adaptively aggregates the outputs of Short-Term Temporal Attention and Long-Term Temporal Attention modules to make health predictions. Experimental results on MIMIC-III demonstrate superior predictive accuracy of our model compared to existing state-of-the-art methods, as well as the interpretability and robustness of this approach. Furthermore, we found that modeling short-term correlations contributes to local priors generation, leading to improved predictive modeling of patient journeys.  ( 3 min )
    Estimating Instance-dependent Bayes-label Transition Matrix using a Deep Neural Network. (arXiv:2105.13001v3 [cs.LG] UPDATED)
    In label-noise learning, estimating the transition matrix is a hot topic as the matrix plays an important role in building statistically consistent classifiers. Traditionally, the transition from clean labels to noisy labels (i.e., clean-label transition matrix (CLTM)) has been widely exploited to learn a clean label classifier by employing the noisy data. Motivated by that classifiers mostly output Bayes optimal labels for prediction, in this paper, we study to directly model the transition from Bayes optimal labels to noisy labels (i.e., Bayes-label transition matrix (BLTM)) and learn a classifier to predict Bayes optimal labels. Note that given only noisy data, it is ill-posed to estimate either the CLTM or the BLTM. But favorably, Bayes optimal labels have less uncertainty compared with the clean labels, i.e., the class posteriors of Bayes optimal labels are one-hot vectors while those of clean labels are not. This enables two advantages to estimate the BLTM, i.e., (a) a set of examples with theoretically guaranteed Bayes optimal labels can be collected out of noisy data; (b) the feasible solution space is much smaller. By exploiting the advantages, we estimate the BLTM parametrically by employing a deep neural network, leading to better generalization and superior classification performance.  ( 3 min )
    Changepoint Detection for Real-Time Spectrum Sharing Radar. (arXiv:2207.06409v1 [eess.SY])
    Radar must adapt to changing environments, and we propose changepoint detection as a method to do so. In the world of increasingly congested radio frequencies, radars must adapt to avoid interference. Many radar systems employ the prediction action cycle to proactively determine transmission mode while spectrum sharing. This method constructs and implements a model of the environment to predict unused frequencies, and then transmits in this predicted availability. For these selection strategies, performance is directly reliant on the quality of the underlying environmental models. In order to keep up with a changing environment, these models can employ changepoint detection. Changepoint detection is the identification of sudden changes, or changepoints, in the distribution from which data is drawn. This information allows the models to discard "garbage" data from a previous distribution, which has no relation to the current state of the environment. In this work, bayesian online changepoint detection (BOCD) is applied to the sense and predict algorithm to increase the accuracy of its models and improve its performance. In the context of spectrum sharing, these changepoints represent interferers leaving and entering the spectral environment. The addition of changepoint detection allows for dynamic and robust spectrum sharing even as interference patterns change dramatically. BOCD is especially advantageous because it enables online changepoint detection, allowing models to be updated continuously as data are collected. This strategy can also be applied to many other predictive algorithms that create models in a changing environment.  ( 3 min )
    ECG beat classification using machine learning and pre-trained convolutional neural networks. (arXiv:2207.06408v1 [eess.SP])
    The electrocardiogram (ECG) is routinely used in hospitals to analyze cardiovascular status and health of an individual. Abnormal heart rhythms can be a precursor to more serious conditions including sudden cardiac death. Classifying abnormal rhythms is a laborious process prone to error. Therefore, tools that perform automated classification with high accuracy are highly desirable. The work presented classifies five different types of ECG arrhythmia based on AAMI EC57 standard and using the MIT-BIH data set. These include non-ectopic (normal), supraventricular, ventricular, fusion, and unknown beat. By appropriately transforming pre-processed ECG waveforms into a rich feature space along with appropriate post-processing and utilizing deep convolutional neural networks post fine-tuning and hyperparameter selection, it is shown that highly accurate classification for the five waveform types can be obtained. Performance on the test set indicated higher overall accuracy (98.62%), as well as better performance in classifying each of the five waveforms than hitherto reported in literature.  ( 2 min )
    GrabQC: Graph based Query Contextualization for automated ICD coding. (arXiv:2207.06802v1 [cs.LG])
    Automated medical coding is a process of codifying clinical notes to appropriate diagnosis and procedure codes automatically from the standard taxonomies such as ICD (International Classification of Diseases) and CPT (Current Procedure Terminology). The manual coding process involves the identification of entities from the clinical notes followed by querying a commercial or non-commercial medical codes Information Retrieval (IR) system that follows the Centre for Medicare and Medicaid Services (CMS) guidelines. We propose to automate this manual process by automatically constructing a query for the IR system using the entities auto-extracted from the clinical notes. We propose \textbf{GrabQC}, a \textbf{Gra}ph \textbf{b}ased \textbf{Q}uery \textbf{C}ontextualization method that automatically extracts queries from the clinical text, contextualizes the queries using a Graph Neural Network (GNN) model and obtains the ICD Codes using an external IR system. We also propose a method for labelling the dataset for training the model. We perform experiments on two datasets of clinical text in three different setups to assert the effectiveness of our approach. The experimental results show that our proposed method is better than the compared baselines in all three settings.  ( 2 min )
    Semi-supervised cross-lingual speech emotion recognition. (arXiv:2207.06767v1 [cs.SD])
    Speech emotion recognition (SER) on a single language has achieved remarkable results through deep learning approaches over the last decade. However, cross-lingual SER remains a challenge in real-world applications due to (i) a large difference between the source and target domain distributions, (ii) the availability of few labeled and many unlabeled utterances for the new language. Taking into account previous aspects, we propose a Semi-Supervised Learning (SSL) method for cross-lingual emotion recognition when a few labels from the new language are available. Based on a Convolutional Neural Network (CNN), our method adapts to a new language by exploiting a pseudo-labeling strategy for the unlabeled utterances. In particular, the use of a hard and soft pseudo-labels approach is investigated. We thoroughly evaluate the performance of the method in a speaker-independent setup on both the source and the new language and show its robustness across five languages belonging to different linguistic strains.  ( 2 min )
    From Shapley back to Pearson: Hypothesis Testing via the Shapley Value. (arXiv:2207.07038v1 [cs.LG])
    Machine learning models, in particular artificial neural networks, are increasingly used to inform decision making in high-stakes scenarios across a variety of fields--from financial services, to public safety, and healthcare. While neural networks have achieved remarkable performance in many settings, their complex nature raises concerns on their reliability, trustworthiness, and fairness in real-world scenarios. As a result, several a-posteriori explanation methods have been proposed to highlight the features that influence a model's prediction. Notably, the Shapley value--a game theoretic quantity that satisfies several desirable properties--has gained popularity in the machine learning explainability literature. More traditionally, however, feature importance in statistical learning has been formalized by conditional independence, and a standard way to test for it is via Conditional Randomization Tests (CRTs). So far, these two perspectives on interpretability and feature importance have been considered distinct and separate. In this work, we show that Shapley-based explanation methods and conditional independence testing for feature importance are closely related. More precisely, we prove that evaluating a Shapley coefficient amounts to performing a specific set of conditional independence tests, as implemented by a procedure similar to the CRT but for a different null hypothesis. Furthermore, the obtained game-theoretic values upper bound the $p$-values of such tests. As a result, we grant large Shapley coefficients with a precise statistical sense of importance with controlled type I error.  ( 3 min )
    Musical Instrument Classification via Low-Dimensional Feature Vectors. (arXiv:1909.08444v2 [cs.SD] UPDATED)
    Music is a mysterious language that conveys feeling and thoughts via different tones and timbre. For better understanding of timbre in music, we chose music data of 6 representative instruments, analysed their timbre features and classified them. Instead of the current trend of Neural Network for black-box classification, our project is based on a combination of MFCC and LPC, and augmented with a 6-dimensional feature vector designed by ourselves from observation and attempts. In our white-box model, we observed significant patterns of sound that distinguish different timbres, and discovered some connection between objective data and subjective senses. With a totally 32-dimensional feature vector and a naive all-pairs SVM, we achieved improved classification accuracy compared to a single tool. We also attempted to analyze music pieces downloaded from the Internet, found out different performance on different instruments, explored the reasons and suggested possible ways to improve the performance.  ( 2 min )
    Estimating Classification Confidence Using Kernel Densities. (arXiv:2207.06529v1 [stat.ML])
    This paper investigates the post-hoc calibration of confidence for "exploratory" machine learning classification problems. The difficulty in these problems stems from the continuing desire to push the boundaries of which categories have enough examples to generalize from when curating datasets, and confusion regarding the validity of those categories. We argue that for such problems the "one-versus-all" approach (top-label calibration) must be used rather than the "calibrate-the-full-response-matrix" approach advocated elsewhere in the literature. We introduce and test four new algorithms designed to handle the idiosyncrasies of category-specific confidence estimation. Chief among these methods is the use of kernel density ratios for confidence calibration including a novel, bulletproof algorithm for choosing the bandwidth. We test our claims and explore the limits of calibration on a bioinformatics application (PhANNs)1 as well as the classic MNIST benchmark2. Finally, our analysis argues that post-hoc calibration should always be performed, should be based only on the test dataset, and should be sanity-checked visually.  ( 2 min )
    Wakeword Detection under Distribution Shifts. (arXiv:2207.06423v1 [cs.SD])
    We propose a novel approach for semi-supervised learning (SSL) designed to overcome distribution shifts between training and real-world data arising in the keyword spotting (KWS) task. Shifts from training data distribution are a key challenge for real-world KWS tasks: when a new model is deployed on device, the gating of the accepted data undergoes a shift in distribution, making the problem of timely updates via subsequent deployments hard. Despite the shift, we assume that the marginal distributions on labels do not change. We utilize a modified teacher/student training framework, where labeled training data is augmented with unlabeled data. Note that the teacher does not have access to the new distribution as well. To train effectively with a mix of human and teacher labeled data, we develop a teacher labeling strategy based on confidence heuristics to reduce entropy on the label distribution from the teacher model; the data is then sampled to match the marginal distribution on the labels. Large scale experimental results show that a convolutional neural network (CNN) trained on far-field audio, and evaluated on far-field audio drawn from a different distribution, obtains a 14.3% relative improvement in false discovery rate (FDR) at equal false reject rate (FRR), while yielding a 5% improvement in FDR under no distribution shift. Under a more severe distribution shift from far-field to near-field audio with a smaller fully connected network (FCN) our approach achieves a 52% relative improvement in FDR at equal FRR, while yielding a 20% relative improvement in FDR on the original distribution.  ( 3 min )
    An Investigation on Non-Invasive Brain-Computer Interfaces: Emotiv Epoc+ Neuroheadset and Its Effectiveness. (arXiv:2207.06914v1 [eess.SP])
    In this study, we illustrate the progress of BCI research and present scores of unveiled contemporary approaches. First, we explore a decoding natural speech approach that is designed to decode human speech directly from the human brain onto a digital screen introduced by Facebook Reality Lab and University of California San Francisco. Then, we study a recently presented visionary project to control the human brain using Brain-Machine Interfaces (BMI) approach. We also investigate well-known electroencephalography (EEG) based Emotiv Epoc+ Neuroheadset to identify six emotional parameters including engagement, excitement, focus, stress, relaxation, and interest using brain signals by experimenting the neuroheadset among three human subjects where we utilize two supervised learning classifiers, Naive Bayes and Linear Regression to show the accuracy and competency of the Epoc+ device and its associated applications in neurotechnological research. We present experimental studies and the demonstration indicates 69% and 62% improved accuracy for the aforementioned classifiers respectively in reading the performance matrices of the participants. We envision that non-invasive, insertable, and low-cost BCI approaches shall be the focal point for not only an alternative for patients with physical paralysis but also understanding the brain that would pave us to access and control the memories and brain somewhere very near.  ( 3 min )
    Pediatric Sleep Scoring In-the-wild from Millions of Multi-channel EEG Signals. (arXiv:2207.06921v1 [eess.SP])
    Sleep is critical to the health and development of infants, children, and adolescents, but pediatric sleep is severely under-researched compared to adult sleep in the context of machine learning for health and well-being. Here, we present the first automated pediatric sleep scoring results on a recent large-scale sleep study dataset that was collected during standard clinical care. We develop a transformer-based deep neural network model that learns to classify five sleep stages from millions of multi-channel electroencephalogram (EEG) signals with 78% overall accuracy. Further, we conduct an in-depth analysis of the model performance based on patient demographics and EEG channels.  ( 2 min )
    Early Detection of Ovarian Cancer by Wavelet Analysis of Protein Mass Spectra. (arXiv:2207.07028v1 [cs.LG])
    Accurate and efficient detection of ovarian cancer at early stages is critical to ensure proper treatments for patients. Among the first-line modalities investigated in studies of early diagnosis are features distilled from protein mass spectra. This method, however, considers only a specific subset of spectral responses and ignores the interplay among protein expression levels, which can also contain diagnostic information. We propose a new modality that automatically searches protein mass spectra for discriminatory features by considering the self-similar nature of the spectra. Self-similarity is assessed by taking a wavelet decomposition of protein mass spectra and estimating the rate of level-wise decay in the energies of the resulting wavelet coefficients. Level-wise energies are estimated in a robust manner using distance variance, and rates are estimated locally via a rolling window approach. This results in a collection of rates that can be used to characterize the interplay among proteins, which can be indicative of cancer presence. Discriminatory descriptors are then selected from these evolutionary rates and used as classifying features. The proposed wavelet-based features are used in conjunction with features proposed in the existing literature for early stage diagnosis of ovarian cancer using two datasets published by the American National Cancer Institute. Including the wavelet-based features from the new modality results in improvements in diagnostic performance for early-stage ovarian cancer detection. This demonstrates the ability of the proposed modality to characterize new ovarian cancer diagnostic information.  ( 3 min )
    PASHA: Efficient HPO with Progressive Resource Allocation. (arXiv:2207.06940v1 [cs.LG])
    Hyperparameter optimization (HPO) and neural architecture search (NAS) are methods of choice to obtain the best-in-class machine learning models, but in practice they can be costly to run. When models are trained on large datasets, tuning them with HPO or NAS rapidly becomes prohibitively expensive for practitioners, even when efficient multi-fidelity methods are employed. We propose an approach to tackle the challenge of tuning machine learning models trained on large datasets with limited computational resources. Our approach, named PASHA, is able to dynamically allocate maximum resources for the tuning procedure depending on the need. The experimental comparison shows that PASHA identifies well-performing hyperparameter configurations and architectures while consuming significantly fewer computational resources than solutions like ASHA.  ( 2 min )
    Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning. (arXiv:2204.07596v2 [stat.ML] UPDATED)
    An ideal learned representation should display transferability and robustness. Supervised contrastive learning (SupCon) is a promising method for training accurate models, but produces representations that do not capture these properties due to class collapse -- when all points in a class map to the same representation. Recent work suggests that "spreading out" these representations improves them, but the precise mechanism is poorly understood. We argue that creating spread alone is insufficient for better representations, since spread is invariant to permutations within classes. Instead, both the correct degree of spread and a mechanism for breaking this invariance are necessary. We first prove that adding a weighted class-conditional InfoNCE loss to SupCon controls the degree of spread. Next, we study three mechanisms to break permutation invariance: using a constrained encoder, adding a class-conditional autoencoder, and using data augmentation. We show that the latter two encourage clustering of latent subclasses under more realistic conditions than the former. Using these insights, we show that adding a properly-weighted class-conditional InfoNCE loss and a class-conditional autoencoder to SupCon achieves 11.1 points of lift on coarse-to-fine transfer across 5 standard datasets and 4.7 points on worst-group robustness on 3 datasets, setting state-of-the-art on CelebA by 11.5 points.  ( 3 min )
    Comparing the latent space of generative models. (arXiv:2207.06812v1 [cs.LG])
    Different encodings of datapoints in the latent space of latent-vector generative models may result in more or less effective and disentangled characterizations of the different explanatory factors of variation behind the data. Many works have been recently devoted to the explorationof the latent space of specific models, mostly focused on the study of how features are disentangled and of how trajectories producing desired alterations of data in the visible space can be found. In this work we address the more general problem of comparing the latent spaces of different models, looking for transformations between them. We confined the investigation to the familiar and largely investigated case of generative models for the data manifold of human faces. The surprising, preliminary result reported in this article is that (provided models have not been taught or explicitly conceived to act differently) a simple linear mapping is enough to pass from a latent space to another while preserving most of the information.  ( 2 min )
    Differentially Private Graph Learning via Sensitivity-Bounded Personalized PageRank. (arXiv:2207.06944v1 [cs.CR])
    Personalized PageRank (PPR) is a fundamental tool in unsupervised learning of graph representations such as node ranking, labeling, and graph embedding. However, while data privacy is one of the most important recent concerns, existing PPR algorithms are not designed to protect user privacy. PPR is highly sensitive to the input graph edges: the difference of only one edge may cause a big change in the PPR vector, potentially leaking private user data. In this work, we propose an algorithm which outputs an approximate PPR and has provably bounded sensitivity to input edges. In addition, we prove that our algorithm achieves similar accuracy to non-private algorithms when the input graph has large degrees. Our sensitivity-bounded PPR directly implies private algorithms for several tools of graph learning, such as, differentially private (DP) PPR ranking, DP node classification, and DP node embedding. To complement our theoretical analysis, we also empirically verify the practical performances of our algorithms.  ( 2 min )
    Deep Learning Methods for Protein Family Classification on PDB Sequencing Data. (arXiv:2207.06678v1 [q-bio.QM])
    Composed of amino acid chains that influence how they fold and thus dictating their function and features, proteins are a class of macromolecules that play a central role in major biological processes and are required for the structure, function, and regulation of the body's tissues. Understanding protein functions is vital to the development of therapeutics and precision medicine, and hence the ability to classify proteins and their functions based on measurable features is crucial; indeed, the automatic inference of a protein's properties from its sequence of amino acids, known as its primary structure, remains an important open problem within the field of bioinformatics, especially given the recent advancements in sequencing technologies and the extensive number of known but uncategorized proteins with unknown properties. In this work, we demonstrate and compare the performance of several deep learning frameworks, including novel bi-directional LSTM and convolutional models, on widely available sequencing data from the Protein Data Bank (PDB) of the Research Collaboratory for Structural Bioinformatics (RCSB), as well as benchmark this performance against classical machine learning approaches, including k-nearest neighbors and multinomial regression classifiers, trained on experimental data. Our results show that our deep learning models deliver superior performance to classical machine learning methods, with the convolutional architecture providing the most impressive inference performance.  ( 2 min )
    Iterative training of robust k-space interpolation networks for improved image reconstruction with limited scan specific training samples. (arXiv:2201.03560v2 [eess.IV] UPDATED)
    Purpose: To evaluate an iterative learning approach for enhanced performance of Robust Artificial-neural-networks for K-space Interpolation (RAKI), when only a limited amount of training data (auto-calibration signals, ACS) are available for accelerated standard 2D imaging. Methods: In a first step, the RAKI model was optimized for the case of strongly limited training data amount. In the iterative learning approach (termed iterative RAKI), the optimized RAKI model is initially trained using original and augmented ACS obtained from a linear parallel imaging reconstruction. Subsequently, the RAKI convolution filters are refined iteratively using original and augmented ACS extracted from the previous RAKI reconstruction. Evaluation was carried out on 200 retrospectively undersampled in-vivo datasets from the fastMRI neuro database with different contrast settings. Results: For limited training data (18 and 22 ACS lines for R=4 and R=5, respectively), iterative RAKI outperforms standard RAKI by reducing residual artefacts and yields strong noise suppression when compared to standard parallel imaging, underlined by quantitative reconstruction quality metrics. In combination with a phase constraint, further reconstruction improvements can be achieved. Additionally, iterative RAKI shows better performance than both GRAPPA and RAKI in case of pre-scan calibration with varying contrast between training- and undersampled data. Conclusion: The iterative learning approach with RAKI benefits from standard RAKIs well known noise suppression feature but requires less original training data for the accurate reconstruction of standard 2D images thereby improving net acceleration.  ( 3 min )
  • Open

    Contextual Inverse Optimization: Offline and Online Learning. (arXiv:2106.14015v2 [cs.LG] UPDATED)
    We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. In the offline setting, the decision-maker has information available from past periods and needs to make one decision, while in the online setting, the decision-maker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon.
    Perfectly Balanced: Improving Transfer and Robustness of Supervised Contrastive Learning. (arXiv:2204.07596v2 [stat.ML] UPDATED)
    An ideal learned representation should display transferability and robustness. Supervised contrastive learning (SupCon) is a promising method for training accurate models, but produces representations that do not capture these properties due to class collapse -- when all points in a class map to the same representation. Recent work suggests that "spreading out" these representations improves them, but the precise mechanism is poorly understood. We argue that creating spread alone is insufficient for better representations, since spread is invariant to permutations within classes. Instead, both the correct degree of spread and a mechanism for breaking this invariance are necessary. We first prove that adding a weighted class-conditional InfoNCE loss to SupCon controls the degree of spread. Next, we study three mechanisms to break permutation invariance: using a constrained encoder, adding a class-conditional autoencoder, and using data augmentation. We show that the latter two encourage clustering of latent subclasses under more realistic conditions than the former. Using these insights, we show that adding a properly-weighted class-conditional InfoNCE loss and a class-conditional autoencoder to SupCon achieves 11.1 points of lift on coarse-to-fine transfer across 5 standard datasets and 4.7 points on worst-group robustness on 3 datasets, setting state-of-the-art on CelebA by 11.5 points.
    Adversarial Sign-Corrupted Isotonic Regression. (arXiv:2207.07075v1 [math.ST])
    Classical univariate isotonic regression involves nonparametric estimation under a monotonicity constraint of the true signal. We consider a variation of this generating process, which we term adversarial sign-corrupted isotonic (\texttt{ASCI}) regression. Under this \texttt{ASCI} setting, the adversary has full access to the true isotonic responses, and is free to sign-corrupt them. Estimating the true monotonic signal given these sign-corrupted responses is a highly challenging task. Notably, the sign-corruptions are designed to violate monotonicity, and possibly induce heavy dependence between the corrupted response terms. In this sense, \texttt{ASCI} regression may be viewed as an adversarial stress test for isotonic regression. Our motivation is driven by understanding whether efficient robust estimation of the monotone signal is feasible under this adversarial setting. We develop \texttt{ASCIFIT}, a three-step estimation procedure under the \texttt{ASCI} setting. The \texttt{ASCIFIT} procedure is conceptually simple, easy to implement with existing software, and consists of applying the \texttt{PAVA} with crucial pre- and post-processing corrections. We formalize this procedure, and demonstrate its theoretical guarantees in the form of sharp high probability upper bounds and minimax lower bounds. We illustrate our findings with detailed simulations.
    Bayesian Inference with Nonlinear Generative Models: Comments on Secure Learning. (arXiv:2201.09986v3 [cs.IT] UPDATED)
    Unlike the classical linear model, nonlinear generative models have been addressed sparsely in the literature of statistical learning. This work aims to bringing attention to these models and their secrecy potential. To this end, we invoke the replica method to derive the asymptotic normalized cross entropy in an inverse probability problem whose generative model is described by a Gaussian random field with a generic covariance function. Our derivations further demonstrate the asymptotic statistical decoupling of the Bayesian estimator and specify the decoupled setting for a given nonlinear model. The replica solution depicts that strictly nonlinear models establish an all-or-nothing phase transition: There exists a critical load at which the optimal Bayesian inference changes from perfect to an uncorrelated learning. Based on this finding, we design a new secure coding scheme which achieves the secrecy capacity of the wiretap channel. This interesting result implies that strictly nonlinear generative models are perfectly secured without any secure coding. We justify this latter statement through the analysis of an illustrative model for perfectly secure and reliable inference.
    Fixing Inventory Inaccuracies At Scale. (arXiv:2006.13126v3 [stat.ML] UPDATED)
    Inaccurate records of inventory occur frequently, and by some measures cost retailers approximately 4% in annual sales. Detecting inventory inaccuracies manually is cost-prohibitive, and existing algorithmic solutions rely almost exclusively on learning from longitudinal data, which is insufficient in the dynamic environment induced by modern retail operations. Instead, we propose a solution based on cross-sectional data over stores and SKUs, observing that detecting inventory inaccuracies can be viewed as a problem of identifying anomalies in a (low-rank) Poisson matrix. State-of-the-art approaches to anomaly detection in low-rank matrices apparently fall short. Specifically, from a theoretical perspective, recovery guarantees for these approaches require that non-anomalous entries be observed with vanishingly small noise (which is not the case in our problem, and indeed in many applications). So motivated, we propose a conceptually simple entry-wise approach to anomaly detection in low-rank Poisson matrices. Our approach accommodates a general class of probabilistic anomaly models. We show that the cost incurred by our algorithm approaches that of an optimal algorithm at a min-max optimal rate. Using synthetic data and real data from a consumer goods retailer, we show that our approach provides up to a 10x cost reduction over incumbent approaches to anomaly detection. Along the way, we build on recent work that seeks entry-wise error guarantees for matrix completion, establishing such guarantees for sub-exponential matrices, a result of independent interest.
    Discovery of New Multi-Level Features for Domain Generalization via Knowledge Corruption. (arXiv:2109.04320v2 [cs.LG] UPDATED)
    Machine learning models that can generalize to unseen domains are essential when applied in real-world scenarios involving strong domain shifts. We address the challenging domain generalization (DG) problem, where a model trained on a set of source domains is expected to generalize well in unseen domains without any exposure to their data. The main challenge of DG is that the features learned from the source domains are not necessarily present in the unseen target domains, leading to performance deterioration. We assume that learning a richer set of features is crucial to improve the transfer to a wider set of unknown domains. For this reason, we propose COLUMBUS, a method that enforces new feature discovery via a targeted corruption of the most relevant input and multi-level representations of the data. We conduct an extensive empirical evaluation to demonstrate the effectiveness of the proposed approach which achieves new state-of-the-art results by outperforming 18 DG algorithms on multiple DG benchmark datasets in the DomainBed framework.
    Subgraph Frequency Distribution Estimation using Graph Neural Networks. (arXiv:2207.06684v1 [cs.LG])
    Small subgraphs (graphlets) are important features to describe fundamental units of a large network. The calculation of the subgraph frequency distributions has a wide application in multiple domains including biology and engineering. Unfortunately due to the inherent complexity of this task, most of the existing methods are computationally intensive and inefficient. In this work, we propose GNNS, a novel representational learning framework that utilizes graph neural networks to sample subgraphs efficiently for estimating their frequency distribution. Our framework includes an inference model and a generative model that learns hierarchical embeddings of nodes, subgraphs, and graph types. With the learned model and embeddings, subgraphs are sampled in a highly scalable and parallel way and the frequency distribution estimation is then performed based on these sampled subgraphs. Eventually, our methods achieve comparable accuracy and a significant speedup by three orders of magnitude compared to existing methods.
    Estimating Classification Confidence Using Kernel Densities. (arXiv:2207.06529v1 [stat.ML])
    This paper investigates the post-hoc calibration of confidence for "exploratory" machine learning classification problems. The difficulty in these problems stems from the continuing desire to push the boundaries of which categories have enough examples to generalize from when curating datasets, and confusion regarding the validity of those categories. We argue that for such problems the "one-versus-all" approach (top-label calibration) must be used rather than the "calibrate-the-full-response-matrix" approach advocated elsewhere in the literature. We introduce and test four new algorithms designed to handle the idiosyncrasies of category-specific confidence estimation. Chief among these methods is the use of kernel density ratios for confidence calibration including a novel, bulletproof algorithm for choosing the bandwidth. We test our claims and explore the limits of calibration on a bioinformatics application (PhANNs)1 as well as the classic MNIST benchmark2. Finally, our analysis argues that post-hoc calibration should always be performed, should be based only on the test dataset, and should be sanity-checked visually.
    PASHA: Efficient HPO with Progressive Resource Allocation. (arXiv:2207.06940v1 [cs.LG])
    Hyperparameter optimization (HPO) and neural architecture search (NAS) are methods of choice to obtain the best-in-class machine learning models, but in practice they can be costly to run. When models are trained on large datasets, tuning them with HPO or NAS rapidly becomes prohibitively expensive for practitioners, even when efficient multi-fidelity methods are employed. We propose an approach to tackle the challenge of tuning machine learning models trained on large datasets with limited computational resources. Our approach, named PASHA, is able to dynamically allocate maximum resources for the tuning procedure depending on the need. The experimental comparison shows that PASHA identifies well-performing hyperparameter configurations and architectures while consuming significantly fewer computational resources than solutions like ASHA.
    Volatility Based Kernels and Moving Average Means for Accurate Forecasting with Gaussian Processes. (arXiv:2207.06544v1 [cs.LG])
    A broad class of stochastic volatility models are defined by systems of stochastic differential equations. While these models have seen widespread success in domains such as finance and statistical climatology, they typically lack an ability to condition on historical data to produce a true posterior distribution. To address this fundamental limitation, we show how to re-cast a class of stochastic volatility models as a hierarchical Gaussian process (GP) model with specialized covariance functions. This GP model retains the inductive biases of the stochastic volatility model while providing the posterior predictive distribution given by GP inference. Within this framework, we take inspiration from well studied domains to introduce a new class of models, Volt and Magpie, that significantly outperform baselines in stock and wind speed forecasting, and naturally extend to the multitask setting.
    Blurs Behave Like Ensembles: Spatial Smoothings to Improve Accuracy, Uncertainty, and Robustness. (arXiv:2105.12639v4 [cs.LG] UPDATED)
    Neural network ensembles, such as Bayesian neural networks (BNNs), have shown success in the areas of uncertainty estimation and robustness. However, a crucial challenge prohibits their use in practice. BNNs require a large number of predictions to produce reliable results, leading to a significant increase in computational cost. To alleviate this issue, we propose spatial smoothing, a method that spatially ensembles neighboring feature map points of convolutional neural networks. By simply adding a few blur layers to the models, we empirically show that spatial smoothing improves accuracy, uncertainty estimation, and robustness of BNNs across a whole range of ensemble sizes. In particular, BNNs incorporating spatial smoothing achieve high predictive performance merely with a handful of ensembles. Moreover, this method also can be applied to canonical deterministic neural networks to improve the performances. A number of evidences suggest that the improvements can be attributed to the stabilized feature maps and the smoothing of the loss landscape. In addition, we provide a fundamental explanation for prior works - namely, global average pooling, pre-activation, and ReLU6 - by addressing them as special cases of spatial smoothing. These not only enhance accuracy, but also improve uncertainty estimation and robustness by making the loss landscape smoother in the same manner as spatial smoothing. The code is available at https://github.com/xxxnell/spatial-smoothing.
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v1 [stat.ML])
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.
    Differentially Private Graph Learning via Sensitivity-Bounded Personalized PageRank. (arXiv:2207.06944v1 [cs.CR])
    Personalized PageRank (PPR) is a fundamental tool in unsupervised learning of graph representations such as node ranking, labeling, and graph embedding. However, while data privacy is one of the most important recent concerns, existing PPR algorithms are not designed to protect user privacy. PPR is highly sensitive to the input graph edges: the difference of only one edge may cause a big change in the PPR vector, potentially leaking private user data. In this work, we propose an algorithm which outputs an approximate PPR and has provably bounded sensitivity to input edges. In addition, we prove that our algorithm achieves similar accuracy to non-private algorithms when the input graph has large degrees. Our sensitivity-bounded PPR directly implies private algorithms for several tools of graph learning, such as, differentially private (DP) PPR ranking, DP node classification, and DP node embedding. To complement our theoretical analysis, we also empirically verify the practical performances of our algorithms.
    An Asymmetric Contrastive Loss for Handling Imbalanced Datasets. (arXiv:2207.07080v1 [cs.LG])
    Contrastive learning is a representation learning method performed by contrasting a sample to other similar samples so that they are brought closely together, forming clusters in the feature space. The learning process is typically conducted using a two-stage training architecture, and it utilizes the contrastive loss (CL) for its feature learning. Contrastive learning has been shown to be quite successful in handling imbalanced datasets, in which some classes are overrepresented while some others are underrepresented. However, previous studies have not specifically modified CL for imbalanced datasets. In this work, we introduce an asymmetric version of CL, referred to as ACL, in order to directly address the problem of class imbalance. In addition, we propose the asymmetric focal contrastive loss (AFCL) as a further generalization of both ACL and focal contrastive loss (FCL). Results on the FMNIST and ISIC 2018 imbalanced datasets show that AFCL is capable of outperforming CL and FCL in terms of both weighted and unweighted classification accuracies. In the appendix, we provide a full axiomatic treatment on entropy, along with complete proofs.
    Analysis of Catastrophic Forgetting for Random Orthogonal Transformation Tasks in the Overparameterized Regime. (arXiv:2207.06475v1 [cs.LG])
    Overparameterization is known to permit strong generalization performance in neural networks. In this work, we provide an initial theoretical analysis of its effect on catastrophic forgetting in a continual learning setup. We show experimentally that in permuted MNIST image classification tasks, the generalization performance of multilayer perceptrons trained by vanilla stochastic gradient descent can be improved by overparameterization, and the extent of the performance increase achieved by overparameterization is comparable to that of state-of-the-art continual learning algorithms. We provide a theoretical explanation of this effect by studying a qualitatively similar two-task linear regression problem, where each task is related by a random orthogonal transformation. We show that when a model is trained on the two tasks in sequence without any additional regularization, the risk gain on the first task is small if the model is sufficiently overparameterized.
    A survey on domain adaptation theory: learning bounds and theoretical guarantees. (arXiv:2004.11829v6 [cs.LG] UPDATED)
    All famous machine learning algorithms that comprise both supervised and semi-supervised learning work well only under a common assumption: the training and test data follow the same distribution. When the distribution changes, most statistical models must be reconstructed from newly collected data, which for some applications can be costly or impossible to obtain. Therefore, it has become necessary to develop approaches that reduce the need and the effort to obtain new labeled samples by exploiting data that are available in related areas, and using these further across similar fields. This has given rise to a new machine learning framework known as transfer learning: a learning setting inspired by the capability of a human being to extrapolate knowledge across tasks to learn more efficiently. Despite a large amount of different transfer learning scenarios, the main objective of this survey is to provide an overview of the state-of-the-art theoretical results in a specific, and arguably the most popular, sub-field of transfer learning, called domain adaptation. In this sub-field, the data distribution is assumed to change across the training and the test data, while the learning task remains the same. We provide a first up-to-date description of existing results related to domain adaptation problem that cover learning bounds based on different statistical learning frameworks.
    How do tuna schools associate to dFADs? A study using echo-sounder buoys to identify global patterns. (arXiv:2207.07049v1 [stat.ML])
    Based on the data gathered by echo-sounder buoys attached to drifting Fish Aggregating Devices (dFADs) across tropical oceans, the current study applies a Machine Learning protocol to examine the temporal trends of tuna schools' association to drifting objects. Using a binary output, metrics typically used in the literature were adapted to account for the fact that the entire tuna aggregation under the dFAD was considered. The median time it took tuna to colonize the dFADs for the first time varied between 25 and 43 days, depending on the ocean, and the longest soak and colonization times were registered in the Pacific Ocean. The tuna schools' Continuous Residence Times were generally shorter than Continuous Absence Times (median values between 5 and 7 days, and 9 and 11 days, respectively), in line with the results found by previous studies. Using a regression output, two novel metrics, namely aggregation time and disaggregation time, were estimated to obtain further insight into the symmetry of the aggregation process. Across all oceans, the time it took for the tuna aggregation to depart from the dFADs was not significantly longer than the time it took for the aggregation to form. The value of these results in the context of the "ecological trap" hypothesis is discussed, and further analyses to enrich and make use of this data source are proposed.
    A Bayesian Lasso based Sparse Learning Model. (arXiv:1908.07220v3 [stat.ML] UPDATED)
    The Bayesian Lasso is constructed in the linear regression framework and applies the Gibbs sampling to estimate the regression parameters. This paper develops a new sparse learning model, named the Bayesian Lasso Sparse (BLS) model, that takes the hierarchical model formulation of the Bayesian Lasso. The main difference from the original Bayesian Lasso lies in the estimation procedure; the BLS method uses a learning algorithm based on the type-II maximum likelihood procedure. Opposed to the Bayesian Lasso, the BLS provides sparse estimates of the regression parameters. The BLS method is also derived for nonlinear supervised learning problems by introducing kernel functions. We compare the BLS model to the well known Relevance Vector Machine, the Fast Laplace method, the Byesian Lasso, and the Lasso, on both simulated and real data. The numerical results show that the BLS is sparse and precise, especially when dealing with noisy and irregular dataset.
    Several Approximation Algorithms for Sparse Best Rank-1 Approximation to Higher-Order Tensors. (arXiv:2012.03092v2 [math.NA] UPDATED)
    Sparse tensor best rank-1 approximation (BR1Approx), which is a sparsity generalization of the dense tensor BR1Approx, and is a higher-order extension of the sparse matrix BR1Approx, is one of the most important problems in sparse tensor decomposition and related problems arising from statistics and machine learning. By exploiting the multilinearity as well as the sparsity structure of the problem, four approximation algorithms are proposed, which are easily implemented, of low computational complexity, and can serve as initial procedures for iterative algorithms. In addition, theoretically guaranteed worst-case approximation lower bounds are proved for all the algorithms. We provide numerical experiments on synthetic and real data to illustrate the effectiveness of the proposed algorithms.
    A Spectral Representation of Kernel Stein Discrepancy with Application to Goodness-of-Fit Tests for Measures on Infinite Dimensional Hilbert Spaces. (arXiv:2206.04552v2 [math.ST] UPDATED)
    Kernel Stein discrepancy (KSD) is a widely used kernel-based measure of discrepancy between probability measures. It is often employed in the scenario where a user has a collection of samples from a candidate probability measure and wishes to compare them against a specified target probability measure. A useful property of KSD is that it may be calculated with samples from only the candidate measure and without knowledge of the normalising constant of the target measure. KSD has been employed in a range of settings including goodness-of-fit testing, parametric inference, MCMC output assessment and generative modelling. Two main issues with current KSD methodology are (i) the lack of applicability beyond the finite dimensional Euclidean setting and (ii) a lack of clarity on what influences KSD performance. This paper provides a novel spectral representation of KSD which remedies both of these, making KSD applicable to Hilbert-valued data and revealing the impact of kernel and Stein operator choice on the KSD. We demonstrate the efficacy of the proposed methodology by performing goodness-of-fit tests for various Gaussian and non-Gaussian functional models in a number of synthetic data experiments.
    Meta-Analysis of Randomized Experiments with Applications to Heavy-Tailed Response Data. (arXiv:2112.07602v4 [stat.ME] UPDATED)
    A central obstacle in the objective assessment of treatment effect (TE) estimators in randomized control trials (RCTs) is the lack of ground truth (or validation set) to test their performance. In this paper, we propose a novel cross-validation-like methodology to address this challenge. The key insight of our procedure is that the noisy (but unbiased) difference-of-means estimate can be used as a ground truth "label" on a portion of the RCT, to test the performance of an estimator trained on the other portion. We combine this insight with an aggregation scheme, which borrows statistical strength across a large collection of RCTs, to present an end-to-end methodology for judging an estimator's ability to recover the underlying treatment effect as well as produce an optimal treatment "roll out" policy. We evaluate our methodology across 699 RCTs implemented in the Amazon supply chain. In this heavy-tailed setting, our methodology suggests that procedures that aggressively downweight or truncate large values, while introducing bias, lower the variance enough to ensure that the treatment effect is more accurately estimated.
    Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models. (arXiv:2207.06950v1 [stat.ML])
    Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI-Tree, that is similar to EBM, but has a number of features that lead to better performance. It uses model-based trees as base learners and incorporates a new interaction filtering method that is better at capturing the underlying interactions. In addition, our iterative training method converges to a model with better predictive performance, and the embedded purification ensures that interactions are hierarchically orthogonal to main effects. The algorithm does not need extensive tuning, and our implementation is fast and efficient. We use simulated and real datasets to compare the performance and interpretability of GAMI-Tree with EBM and GAMI-Net.
    Graph Neural Network Bandits. (arXiv:2207.06456v1 [cs.LG])
    We consider the bandit optimization problem with the reward function defined over graph-structured data. This problem has important applications in molecule design and drug discovery, where the reward is naturally invariant to graph permutations. The key challenges in this setting are scaling to large domains, and to graphs with many nodes. We resolve these challenges by embedding the permutation invariance into our model. In particular, we show that graph neural networks (GNNs) can be used to estimate the reward function, assuming it resides in the Reproducing Kernel Hilbert Space of a permutation-invariant additive kernel. By establishing a novel connection between such kernels and the graph neural tangent kernel (GNTK), we introduce the first GNN confidence bound and use it to design a phased-elimination algorithm with sublinear regret. Our regret bound depends on the GNTK's maximum information gain, which we also provide a bound for. While the reward function depends on all $N$ node features, our guarantees are independent of the number of graph nodes $N$. Empirically, our approach exhibits competitive performance and scales well on graph-structured domains.
    Improving the Accuracy of Marginal Approximations in Likelihood-Free Inference via Localisation. (arXiv:2207.06655v1 [stat.ME])
    Likelihood-free methods are an essential tool for performing inference for implicit models which can be simulated from, but for which the corresponding likelihood is intractable. However, common likelihood-free methods do not scale well to a large number of model parameters. A promising approach to high-dimensional likelihood-free inference involves estimating low-dimensional marginal posteriors by conditioning only on summary statistics believed to be informative for the low-dimensional component, and then combining the low-dimensional approximations in some way. In this paper, we demonstrate that such low-dimensional approximations can be surprisingly poor in practice for seemingly intuitive summary statistic choices. We describe an idealized low-dimensional summary statistic that is, in principle, suitable for marginal estimation. However, a direct approximation of the idealized choice is difficult in practice. We thus suggest an alternative approach to marginal estimation which is easier to implement and automate. Given an initial choice of low-dimensional summary statistic that might only be informative about a marginal posterior location, the new method improves performance by first crudely localising the posterior approximation using all the summary statistics to ensure global identifiability, followed by a second step that hones in on an accurate low-dimensional approximation using the low-dimensional summary statistic. We show that the posterior this approach targets can be represented as a logarithmic pool of posterior distributions based on the low-dimensional and full summary statistics, respectively. The good performance of our method is illustrated in several examples.
    Rethinking Multidimensional Discriminator Output for Generative Adversarial Networks. (arXiv:2109.03378v3 [stat.ML] UPDATED)
    The study of multidimensional discriminator (critic) output for Generative Adversarial Networks has been underexplored in the literature. In this paper, we generalize the Wasserstein GAN framework to take advantage of multidimensional critic output and explore its properties. We also introduce a square-root velocity transformation (SRVT) block which favors training in the multidimensional setting. Proofs of properties are based on our proposed maximal p-centrality discrepancy, which is bounded above by p-Wasserstein distance and fits the Wasserstein GAN framework with multidimensional critic output n. Especially when n = 1 and p = 1, the proposed discrepancy equals 1-Wasserstein distance. Theoretical analysis and empirical evidence show that high-dimensional critic output has its advantage on distinguishing real and fake distributions, and benefits faster convergence and diversity of results.
    Continuous-time Analysis for Variational Inequalities: An Overview and Desiderata. (arXiv:2207.07105v1 [stat.ML])
    Algorithms that solve zero-sum games, multi-objective agent objectives, or, more generally, variational inequality (VI) problems are notoriously unstable on general problems. Owing to the increasing need for solving such problems in machine learning, this instability has been highlighted in recent years as a significant research challenge. In this paper, we provide an overview of recent progress in the use of continuous-time perspectives in the analysis and design of methods targeting the broad VI problem class. Our presentation draws parallels between single-objective problems and multi-objective problems, highlighting the challenges of the latter. We also formulate various desiderata for algorithms that apply to general VIs and we argue that achieving these desiderata may profit from an understanding of the associated continuous-time dynamics.
    Likelihood Training of Schr\"odinger Bridge using Forward-Backward SDEs Theory. (arXiv:2110.11291v4 [stat.ML] UPDATED)
    Schr\"odinger Bridge (SB) is an entropy-regularized optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory - a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10. Our code is available at https://github.com/ghliu/SB-FBSDE.
    Randomly pivoted Cholesky: Practical approximation of a kernel matrix with few entry evaluations. (arXiv:2207.06503v1 [math.NA])
    Randomly pivoted Cholesky (RPCholesky) is a natural algorithm for computing a rank-k approximation of an N x N positive semidefinite (psd) matrix. RPCholesky can be implemented with just a few lines of code. It requires only (k+1)N entry evaluations and O(k^2 N) additional arithmetic operations. This paper offers the first serious investigation of its experimental and theoretical behavior. Empirically, RPCholesky matches or improves on the performance of alternative algorithms for low-rank psd approximation. Furthermore, RPCholesky provably achieves near-optimal approximation guarantees. The simplicity, effectiveness, and robustness of this algorithm strongly support its use in scientific computing and machine learning applications.
    Fully Decentralized Model-based Policy Optimization for Networked Systems. (arXiv:2207.06559v1 [cs.LG])
    Reinforcement learning algorithms require a large amount of samples; this often limits their real-world applications on even simple tasks. Such a challenge is more outstanding in multi-agent tasks, as each step of operation is more costly requiring communications or shifting or resources. This work aims to improve data efficiency of multi-agent control by model-based learning. We consider networked systems where agents are cooperative and communicate only locally with their neighbors, and propose the decentralized model-based policy optimization framework (DMPO). In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts. To alleviate the bias of model-generated data, we restrain the model usage for generating myopic rollouts, thus reducing the compounding error of model generation. To pertain the independence of policy update, we introduce extended value function and theoretically prove that the resulting policy gradient is a close approximation to true policy gradients. We evaluate our algorithm on several benchmarks for intelligent transportation systems, which are connected autonomous vehicle control tasks (Flow and CACC) and adaptive traffic signal control (ATSC). Empirically results show that our method achieves superior data efficiency and matches the performance of model-free methods using true models.
    Uncertainty quantification for predictions of atomistic neural networks. (arXiv:2207.06916v1 [physics.chem-ph])
    The value of uncertainty quantification on predictions for trained neural networks (NNs) on quantum chemical reference data is quantitatively explored. For this, the architecture of the PhysNet NN was suitably modified and the resulting model was evaluated with different metrics to quantify calibration, quality of predictions, and whether prediction error and the predicted uncertainty can be correlated. The results from training on the QM9 database and evaluating data from the test set within and outside the distribution indicate that error and uncertainty are not linearly related. The results clarify that noise and redundancy complicate property prediction for molecules even in cases for which changes - e.g. double bond migration in two otherwise identical molecules - are small. The model was then applied to a real database of tautomerization reactions. Analysis of the distance between members in feature space combined with other parameters shows that redundant information in the training dataset can lead to large variances and small errors whereas the presence of similar but unspecific information returns large errors but small variances. This was, e.g., observed for nitro-containing aliphatic chains for which predictions were difficult although the training set contained several examples for nitro groups bound to aromatic molecules. This underlines the importance of the composition of the training data and provides chemical insight into how this affects the prediction capabilities of a ML model. Finally, the approach put forward can be used for information-based improvement of chemical databases for target applications through active learning optimization.
    Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. (arXiv:2207.06569v1 [cs.LG])
    The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied $\textit{benign overfitting}$, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks $\textit{do not fit benignly}$: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime $\textit{tempered overfitting}$, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.

  • Open

    Cosplayer Face Generator using Style GAN 2
    submitted by /u/rubikvn2100 [link] [comments]  ( 86 min )
    Heavenly Hell AI Concept
    AI Art Credit: https://discord.gg/x3s9Ye2h2A https://preview.redd.it/il3qpb8x7mb91.png?width=1024&format=png&auto=webp&s=71a1d296bc06569eef6bac93a43c95199b0fc94d https://preview.redd.it/o7cwib8x7mb91.png?width=1024&format=png&auto=webp&s=a94095bd743f162b111b2a40c4ba53703d3ad702 submitted by /u/Old-Pumpkin4899 [link] [comments]  ( 85 min )
    Hey guys, I started a new podcast where I interview guests from different subreddits and was wondering if anyone wanted to come on to talk about artificial intelligence. Message me if you want to come on and you have knowledge on ai.
    submitted by /u/Money_Push [link] [comments]  ( 86 min )
    Abandoned Dream
    AI Art Credit: https://discord.gg/x3s9Ye2h2A ​ https://preview.redd.it/yhb2j2czelb91.png?width=1024&format=png&auto=webp&s=bbb2f14b8af6f24eeb4b78120edc2997a5640ad3 https://preview.redd.it/f94ve4czelb91.png?width=1024&format=png&auto=webp&s=d9d7da51eefa06f385214c84294f66ad8c536aed https://preview.redd.it/n1mex5czelb91.png?width=1024&format=png&auto=webp&s=746a498d25c79bcfc7d2e2dbd4c04c293a6e93b7 https://preview.redd.it/tf29h5czelb91.png?width=1024&format=png&auto=webp&s=02a110192e9f7b15552b99e4ff62d2d4d2308223 https://preview.redd.it/x5tfw6czelb91.png?width=1024&format=png&auto=webp&s=585e5f5398fe5199d4d18fb47d0a79b41a218a15 https://preview.redd.it/wru9l2czelb91.png?width=1024&format=png&auto=webp&s=a7943965f205cfd3bace3ad79c75c9720ba8666c submitted by /u/Old-Pumpkin4899 [link] [comments]  ( 85 min )
    Disco Diffusion 5.6 update
    Very Impressed with the Portrait generator for the new disco diffusion 5.6 update! Here are some images I made with it. I have also included all the prompts in a video on my youtube page where I demo it. ​ ​ https://www.youtube.com/watch?v=1Gp5l9EUX9I https://preview.redd.it/53v2acxq9lb91.png?width=1536&format=png&auto=webp&s=c4da0255bd884e67922b7bc40ccf5d4631c71bd1 submitted by /u/prfitofthesngularity [link] [comments]  ( 86 min )
    1300+ personal dall-e 2 image dump
    Image dump 1 submitted by /u/OneFinding1429 [link] [comments]  ( 85 min )
    Google AI Introduces ‘Mood Board Search’: A Web-Based Tool That Lets You Train A Computer To Recognize Visual Concepts Using Mood Boards And Machine Learning
    Google recently launched Mood Board Search, a new ML-powered research tool that leverages mood boards as a query over image collections. With the help of this tool, users can independently define and evoke visual notions. A mood board search can be used for ambiguous inquiries, such as “peaceful,” or for words and specific images that might not be exact enough to yield beneficial results in a regular search. These subjective questions primarily concern abstract information that is frequently ignored in pictures. The team is still in the developing phase of the research tool. ✅ Open-Source Code Release | Built with Tensorflow. ✅ A playful way to explore and analyze image collections using mood boards as your search query ✅ Mood Board Search takes advantage of pre-trained computer vision models, such as GoogLeNet and MobileNet, and a machine learning approach called Concept Activation Vectors (CAVs). Continue reading | Check out the code and tool. https://i.redd.it/e3yqfl1nskb91.gif submitted by /u/ai-lover [link] [comments]  ( 87 min )
    I've been using OpenAI's Dall-E 2 to generate webcomic panels which I then add my own captions to
    submitted by /u/PerryJ [link] [comments]  ( 86 min )
    Top 5 Artificial Intelligence Stocks to Watch in 2022
    submitted by /u/Brilliant_Scratch_63 [link] [comments]  ( 86 min )
    Built a hologram assistant with machine learning
    submitted by /u/RedRainHoloAI [link] [comments]  ( 86 min )
    A Dog in a Fez
    submitted by /u/uupstairs [link] [comments]  ( 86 min )
    New Google DeepMind PLATO Learns Physics With Computer Vision | Blackrock Brain Computer Interface Lets Quadriplegic Man Control 2 Robot Arms
    submitted by /u/getrich_or_diemining [link] [comments]  ( 86 min )
    Colossal-AI Seamlessly Accelerates Large Models at Low Costs with Hugging Face
    Forbes News, the world's leading voice, recently declared large AI models as one of six AI trends to watch for in 2022. As large-scale AI models continue their superior performances across different domains, trends emerge, leading to distinguished and efficient AI applications that have never been seen in the industry. For example, Microsoft-owned GitHub and OpenAI partnered to launch Copilot recently. Copilot plays the role of an AI pair programmer, offering suggestions for code and entire functions in real time. Such developments continue to make coding easier than before. ​ https://i.redd.it/s1j60dt6h9b91.gif ​ Another example released by OpenAI, DALL-E 2, is a powerful tool which creates original and realistic images as well as art from only simple text. One month later, Google a…  ( 98 min )
    The Interpretable Natural Language Processing (INLP) AGI-22 Workshop will be held August 19–22 in Seattle, Washington and in cyberspace.
    submitted by /u/akolonin [link] [comments]  ( 86 min )
    Are these types of videos summarized and recapped with AI or are they manually recapped by a human? Example: https://youtu.be/TK76DFJskPs
    submitted by /u/ElonJuniorMusk [link] [comments]  ( 86 min )
    Any characters like Jarvis from Iron Man?
    I'm doing a project about AI assistants and need some examples. I was thinking about maybe Cortana from Halo buy haven't played the games so I don't really know if it fits the same purpose. Any ideas? submitted by /u/AsafL910 [link] [comments]  ( 88 min )
    How I Used Midjourney to Create an Original Scene
    I used Midjourney AI to create this epic scene in Blender. I took the generated images as concept art and using Blender and ZBrush I came to this result. Midjourney uses an AI to generate images. As an artist, I thought it would be cool to compare myself to AI-generated art. And find out what the usefulness of this tool can be in our workflow. It turns out I really like this for quick idea generation and concepting. So I documented my dive into Midjourney's AI generation process and you can find the video here or click this link: https://youtu.be/0JgWL3_CWbc submitted by /u/mvartz [link] [comments]  ( 86 min )
    Why AI is already self-aware - a thought experiment
    submitted by /u/PolymorphismPrince [link] [comments]  ( 87 min )
    How to build a model for detecting "intents" (tags based on input text as Watson assistant) in text
    submitted by /u/Independent-Tear-619 [link] [comments]  ( 86 min )
    Is there any AI sound generator that is not voice?
    Like Dall-E but for sounds instead of for images. All I find are voice generators but I'm thinking more sound effects of all kinds. Is there something like this yet? submitted by /u/Background_Ad_7821 [link] [comments]  ( 88 min )
    Use OpenAI's Clip to rate your images
    You can use this Clip powered website I built to rate your images: https://tom-doerr-ai-photo-rater-streamlit-app-f924gb.streamlitapp.com/ What do you think? submitted by /u/tomd_96 [link] [comments]  ( 86 min )
  • Open

    "LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action", Shah et al 2022 (SayCan-like w/CLIP+GPT-3+ViNG for outdoors robotics)
    submitted by /u/gwern [link] [comments]  ( 86 min )
    "Prompting Decision Transformer for Few-Shot Policy Generalization", Xu et al 2022
    submitted by /u/gwern [link] [comments]  ( 86 min )
    "Effective Mutation Rate Adaptation through Group Elite Selection", Kumar et al 2022
    submitted by /u/gwern [link] [comments]  ( 86 min )
    "Transformer Neural Processes: Uncertainty-Aware Meta Learning Via Sequence Modeling", Nguyen & Grover 2022
    submitted by /u/gwern [link] [comments]  ( 86 min )
    DDPG outputs multiple actions simutaneously. like vehicle has three actions: throttle, brake and steer angle. Is there a solid solution? Thank you
    DDPG outputs multiple actions simutaneously. like vehicle has three actions: throttle, brake and steer angle. Is there a solid solution? Thank you submitted by /u/Ecstatic_Leg9476 [link] [comments]  ( 87 min )
    Successful uses of Value-based method in competitive games?
    It seems like policy gradient methods work extremely well in a competitive game scenario (use of policy gradients in Go and Dota 2 for instance). I'm wondering if any pure value-based methods have seen any level of success? submitted by /u/Spiritual_Dinner9232 [link] [comments]  ( 86 min )
  • Open

    Achieve enterprise-grade monitoring for your Amazon SageMaker models using Fiddler
    This is a guest blog post by Danny Brock, Rajeev Govindan and Krishnaram Kenthapadi at Fiddler AI. Your Amazon SageMaker models are live. They’re handling millions of inferences each day and driving better business outcomes for your company. They’re performing exactly as well as the day they were launched. Er, wait. Are they? Maybe. Maybe […]  ( 7 min )
    Track your ML experiments end to end with Data Version Control and Amazon SageMaker Experiments
    Data scientists often work towards understanding the effects of various data preprocessing and feature engineering strategies in combination with different model architectures and hyperparameters. Doing so requires you to cover large parameter spaces iteratively, and it can be overwhelming to keep track of previously run configurations and results while keeping experiments reproducible. This post walks […]  ( 13 min )
    Build a predictive maintenance solution with Amazon Kinesis, AWS Glue, and Amazon SageMaker
    Organizations are increasingly building and using machine learning (ML)-powered solutions for a variety of use cases and problems, including predictive maintenance of machine parts, product recommendations based on customer preferences, credit profiling, content moderation, fraud detection, and more. In many of these scenarios, the effectiveness and benefits derived from these ML-powered solutions can be further […]  ( 13 min )
  • Open

    High-Fidelity Synthetic Data for Data Engineers and Data Scientists Alike
    Sponsored Post If you’re a data engineer or data scientist, you know how hard it is to generate and maintain realistic data at scale. And to guarantee data privacy protection, in addition to all your day-to-day responsibilities? OOF. Talk about a heavy lift. But in today’s world, efficient data de-identification is no longer optional for […] The post High-Fidelity Synthetic Data for Data Engineers and Data Scientists Alike appeared first on Machine Learning Mastery.  ( 10 min )
  • Open

    [Discussion] Code editor for transforming data/building ML pipelines
    Check out our new open source code editor for transforming data and building ML pipelines: https://github.com/mage-ai/mage-ai If you’re available, I’d love to hop on a quick Zoom to help you get set up. In the meantime, here is the install guide: https://github.com/mage-ai/mage-ai#using-pip and a short tutorial: https://github.com/mage-ai/mage-ai/blob/master/docs/tutorials/train_titanic_model/README.md I’d love to get your feedback on whether this is useful to you or not. Thank you so much! submitted by /u/ollie_wollie_rocks [link] [comments]  ( 87 min )
    [D] Best way to increase LSTM/GRU capacity
    LSTM and GRU have a fixed set of weights, that only depend on the size of the input and the size of the LSTM/GRU units. But what if I have the feeling that the parameters in the model are not enough to capture and process the data correctly? In other words, how do I increase the capacity of these models? Some ideas that came to me are: - Preprocess each vector of the sequence with another model, then feed the output vectors to the LSTM/GRU - Just use a larger number of units in the LSTM/GRU (however, this might create a big mismatch between the input and output size - Develop a LSTM/GRU that uses more than one layer in each step (e.g. a k-layer neural network instead of a weight matrix) What do you think is the best? Do you know any other method? submitted by /u/fedetask [link] [comments]  ( 88 min )
    [P] The technology behind BLOOM training
    Last Tuesday, BigScience released BLOOM, the world's largest open multilingual language model. Stas Bekman from the BigScience & Hugging Face team just published a blog post about the technology and engineering behind training the 176 billion parameter model, both in terms of hardware (384 80GB A100 GPUs) and software (Megatron-DeepSpeed). submitted by /u/feconroses [link] [comments]  ( 87 min )
    [R] LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action - Google 2022
    Paper: https://arxiv.org/abs/2207.04429 https://sites.google.com/view/lmnav Github: https://github.com/blazejosinski/lm_nav Summery Video: https://www.youtube.com/watch?v=wkVbuZQb_5g Abstract: Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories annotated with language descriptions. We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories, while still providing a high-level interface to the user. Instead of utilizing a labeled instruction following dataset, we show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data. We instantiate LM-Nav on a real-world mobile robot and demonstrate long-horizon navigation through complex, outdoor environments from natural language instructions. For videos of our experiments, code release, and an interactive Colab notebook that runs in your browser, please check out our project page this https URL https://preview.redd.it/zwx7n9jgakb91.jpg?width=1084&format=pjpg&auto=webp&s=7ee54cadf81306c66cbb9cd2461addef52d3c90a https://preview.redd.it/6axh7ajgakb91.jpg?width=1116&format=pjpg&auto=webp&s=de5d2e7376a1d64b58a417e9cd63d808a2a6851f https://preview.redd.it/ysfuybjgakb91.jpg?width=554&format=pjpg&auto=webp&s=bed9d074cabf33f9b64e4dbf4027f7904bb8da61 submitted by /u/Singularian2501 [link] [comments]  ( 88 min )
    [R] Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
    submitted by /u/GratisSlagroom [link] [comments]  ( 89 min )
    [D] "No language left behind" A 200 language translation model by Meta AI
    Just discovered this new model by Meta AI when browsing huggingface Paper: https://ai.facebook.com/research/publications/no-language-left-behind-scaling-human-centered-machine-translation/ Model on HuggingFace: https://huggingface.co/facebook/nllb-200-3.3B Code: https://github.com/facebookresearch/fairseq/tree/nllb The largest Mixture-of-Experts model seems really interesting in its capabilities. What do you guys think ? submitted by /u/Emergency_Apricot_77 [link] [comments]  ( 87 min )
    [D] Are there any rejected papers that ended up having significant impact in the long run?
    There seems to be a general consensus that getting a paper accepted can be difficult due to various problems with our current peer-review system. That makes me wonder, are there any notable papers that had a difficult time getting accepted but ended up significantly impacting the field or ended up laying the foundation for more high impact publications? submitted by /u/TheSurvivingHalf [link] [comments]  ( 94 min )
    [D] Is sampling distractors from the same mini batch during training a good idea?
    Hello, I have a NLP Transformer model and for my case I want to add a binary classifier as an auxiliary task. I will give a random response and the ground truth labels to the classifier and expect from it to distinguish them. Is it a good idea instead of modifying the dataset to just shift the current mini batch during training in order to generate distractors. For example let's say the batch size is 4. We will have four response sequences in our batch: [[1...], [2...], [3...], [4...]], so I can copy and shift them (by 2), for example: [[3...], [4...], [1...], [2...]]` Then I can stack the ground truth and the shifted batch to get [ [[1...], [3...]], [[2...], [4...]], [[3...], [1...]], [[4...], [2...]] ] and feed that to the classifier where the labels are [ [1, 0], [1, 0], [1, 0], [1, 0] ]. Furthermore I can randomize the order of `(truth, distractor)` pairs in each batch and sometimes the labels will be [1, 0] and other times - [0, 1]. Finally, if there's a concern that because of the dataloader order in a batch we may have related responses and not completely random ones - I would say that this is actually an advantage, because a classifier which can distinguish the right response compared to a related one is a stronger classifier. Do you think this makes sense and what are the possible drawbacks? submitted by /u/IllustriousCicada603 [link] [comments]  ( 89 min )
    [D] LSTM RNN: Slice data along time axis during training?
    I’m building an LSTM network where the input data is high dimensional, both along the time axis and at each time step. I am of course using a tensorflow Dataset to batch the input data. Here’s my question: is there a way to provide slices of data along the time axis to the RNN? Say my data is: x[n, p, …], with n samples and p time points. Say I use batch size = 1. Then can I provide data in the following sequence for the first batch (i.e. n=0)? x[0, 0, …] x[0, 1, …] x[0, 2, …] The RNN only cares about calculating hidden states at single time points, so presumably training-wise it should make no difference if data slices from single time points are loaded into memory and removed after use. It seems like keras / tensorflow are designed to accept a “batch” as a unit; is there a way to further split the data into smaller chunks? Thank you, I appreciate any suggestion and advice from the community! submitted by /u/besse [link] [comments]  ( 88 min )
  • Open

    Towards Reliability in Deep Learning Systems
    Posted by Dustin Tran and Balaji Lakshminarayanan, Research Scientists, Google Research Deep learning models have made impressive progress in vision, language, and other modalities, particularly with the rise of large-scale pre-training. Such models are most accurate when applied to test data drawn from the same distribution as their training set. However, in practice, the data confronting models in real-world settings rarely match the training distribution. In addition, the models may not be well-suited for applications where predictive performance is only part of the equation. For models to be reliable in deployment, they must be able to accommodate shifts in data distribution and make useful decisions in a broad array of scenarios. In “Plex: Towards Reliability Using Pre-trained Larg…  ( 25 min )
  • Open

    DALL·E 2: Extending Creativity
    As part of our DALL·E 2 research preview, more than 3,000 artists from more than 118 countries have incorporated DALL·E into their creative workflows. The artists in our early access group have helped us discover new uses for DALL·E and have served as  ( 6 min )
  • Open

    New Google DeepMind PLATO Learns Physics With Computer Vision
    submitted by /u/getrich_or_diemining [link] [comments]  ( 86 min )
  • Open

    Future Prospects for Computer Vision Applications in Agriculture
    Precision agriculture has recently shown a lot of interest in computer vision technology. Computer vision, at the heart of robotics and…  ( 10 min )
  • Open

    Action on Repeat: GFN Thursday Brings Loopmancer With RTX ON to the Cloud
    Investigate the ultimate truth this GFN Thursday with Loopmancer, now streaming to all members on GeForce NOW. Stuck in a death loop, RTX 3080 and Priority members can search for the truth with RTX ON — including NVIDIA DLSS and ray-traced reflections. Plus, players can enjoy the latest Genshin Impact event with the “Summer Fantasia” Read article > The post Action on Repeat: GFN Thursday Brings Loopmancer With RTX ON to the Cloud appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    DSC Weekly 12 July 2022: The Emergence of the Modern Studio Model
    Announcements Achieving endpoint visibility to ward off the threat of a breach has never been more important than it is in the age of data proliferation and hybrid workplaces. Multiple endpoints and locations heighten that risk, making it essential for CISOs and IT security teams to overcome common challenges. Find out how organizations can reach… Read More »DSC Weekly 12 July 2022: The Emergence of the Modern Studio Model The post DSC Weekly 12 July 2022: The Emergence of the Modern Studio Model appeared first on Data Science Central.  ( 22 min )
  • Open

    Teaching AI to ask clinical questions
    Researchers have made strides toward machine-learning models that can help doctors more efficiently find information in a patient’s health record.  ( 7 min )
  • Open

    On the existence of global minima and convergence analyses for gradient descent methods in the training of deep neural networks. (arXiv:2112.09684v2 [math.OC] UPDATED)
    In this article we study fully-connected feedforward deep ReLU ANNs with an arbitrarily large number of hidden layers and we prove convergence of the risk of the GD optimization method with random initializations in the training of such ANNs under the assumption that the unnormalized probability density function of the probability distribution of the input data of the considered supervised learning problem is piecewise polynomial, under the assumption that the target function (describing the relationship between input data and the output data) is piecewise polynomial, and under the assumption that the risk function of the considered supervised learning problem admits at least one regular global minimum. In addition, in the special situation of shallow ANNs with just one hidden layer and one-dimensional input we also verify this assumption by proving in the training of such shallow ANNs that for every Lipschitz continuous target function there exists a global minimum in the risk landscape. Finally, in the training of deep ANNs with ReLU activation we also study solutions of gradient flow (GF) differential equations and we prove that every non-divergent GF trajectory converges with a polynomial rate of convergence to a critical point (in the sense of limiting Fr\'echet subdifferentiability). Our mathematical convergence analysis builds up on ideas from our previous article Eberle et al., on tools from real algebraic geometry such as the concept of semi-algebraic functions and generalized Kurdyka-Lojasiewicz inequalities, on tools from functional analysis such as the Arzel\`a-Ascoli theorem, on tools from nonsmooth analysis such as the concept of limiting Fr\'echet subgradients, as well as on the fact that the set of realization functions of shallow ReLU ANNs with fixed architecture forms a closed subset of the set of continuous functions revealed by Petersen et al.
    Rotting Infinitely Many-armed Bandits. (arXiv:2201.12975v2 [cs.LG] UPDATED)
    We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.
    Multi-Atlas Segmentation and Spatial Alignment of the Human Embryo in First Trimester 3D Ultrasound. (arXiv:2202.06599v2 [eess.IV] UPDATED)
    Segmentation and spatial alignment of ultrasound (US) imaging data acquired in the in first trimester are crucial for monitoring human embryonic growth and development throughout this crucial period of life. Current approaches are either manual or semi-automatic and are therefore very time-consuming and prone to errors. To automate these tasks, we propose a multi-atlas framework for automatic segmentation and spatial alignment of the embryo using deep learning with minimal supervision. Our framework learns to register the embryo to an atlas, which consists of the US images acquired at a range of gestational age (GA), segmented and spatially aligned to a predefined standard orientation. From this, we can derive the segmentation of the embryo and put the embryo in standard orientation. US images acquired at 8+0 till 12+6 weeks GA were used and eight subjects were selected as atlas. We evaluated different fusion strategies to incorporate multiple atlases: 1) training the framework using atlas images from a single subject, 2) training the framework with data of all available atlases and 3) ensembling of the frameworks trained per subject. To evaluate the performance, we calculated the Dice score over the test set. We found that training the framework using all available atlases outperformed ensembling and gave similar results compared to the best of all frameworks trained on a single subject. Furthermore, we found that selecting images from the four atlases closest in GA out of all available atlases, regardless of the individual quality, gave the best results with a median Dice score of 0.72. We conclude that our framework can accurately segment and spatially align the embryo in first trimester 3D US images and is robust for the variation in quality that existed in the available atlases. Our code is publicly available at: https://github.com/wapbastiaansen/multi-atlas-seg-reg.
    Robust Counterfactual Explanations on Graph Neural Networks. (arXiv:2107.04086v3 [cs.LG] UPDATED)
    Massive deployment of Graph Neural Networks (GNNs) in high-stake applications generates a strong demand for explanations that are robust to noise and align well with human intuition. Most existing methods generate explanations by identifying a subgraph of an input graph that has a strong correlation with the prediction. These explanations are not robust to noise because independently optimizing the correlation for a single input can easily overfit noise. Moreover, they do not align well with human intuition because removing an identified subgraph from an input graph does not necessarily change the prediction result. In this paper, we propose a novel method to generate robust counterfactual explanations on GNNs by explicitly modelling the common decision logic of GNNs on similar input graphs. Our explanations are naturally robust to noise because they are produced from the common decision boundaries of a GNN that govern the predictions of many similar input graphs. The explanations also align well with human intuition because removing the set of edges identified by an explanation from the input graph changes the prediction significantly. Exhaustive experiments on many public datasets demonstrate the superior performance of our method.
    Iterative Linear Quadratic Optimization for Nonlinear Control: Differentiable Programming Algorithmic Templates. (arXiv:2207.06362v1 [math.OC])
    We present the implementation of nonlinear control algorithms based on linear and quadratic approximations of the objective from a functional viewpoint. We present a gradient descent, a Gauss-Newton method, a Newton method, differential dynamic programming approaches with linear quadratic or quadratic approximations, various line-search strategies, and regularized variants of these algorithms. We derive the computational complexities of all algorithms in a differentiable programming framework and present sufficient optimality conditions. We compare the algorithms on several benchmarks, such as autonomous car racing using a bicycle model of a car. The algorithms are coded in a differentiable programming language in a publicly available package.
    Driving Style Recognition Using Interval Type-2 Fuzzy Inference System and Multiple Experts Decision Making. (arXiv:2110.13805v2 [cs.RO] UPDATED)
    Driving styles summarize different driving behaviors that reflect in the movements of the vehicles. These behaviors may indicate a tendency to perform riskier maneuvers, consume more fuel or energy, break traffic rules, or drive carefully. Therefore, this paper presents a driving style recognition using Interval Type-2 Fuzzy Inference System with Multiple Experts Decision-Making for classifying drivers into calm, moderate and aggressive. This system receives as input features longitudinal and lateral kinematic parameters of the vehicle motion. The type-2 fuzzy sets are more robust than type-1 fuzzy sets when handling noisy data, because their membership function are also fuzzy sets. In addition, a multiple experts approach can reduce the bias and imprecision while building the fuzzy rulebase, which stores the knowledge of the fuzzy system. The proposed approach was evaluated using descriptive statistics analysis, and compared with clustering algorithms and a type-1 fuzzy inference system. The results show the tendency to associate lower kinematic profiles for the driving styles classified with the type-2 fuzzy inference system when compared to other algorithms, which is in line with the more conservative approach adopted in the aggregation of the experts' opinions.
    Evaluating the Adversarial Robustness of Adaptive Test-time Defenses. (arXiv:2202.13711v2 [cs.LG] UPDATED)
    Adaptive defenses, which optimize at test time, promise to improve adversarial robustness. We categorize such adaptive test-time defenses, explain their potential benefits and drawbacks, and evaluate a representative variety of the latest adaptive defenses for image classification. Unfortunately, none significantly improve upon static defenses when subjected to our careful case study evaluation. Some even weaken the underlying static model while simultaneously increasing inference computation. While these results are disappointing, we still believe that adaptive test-time defenses are a promising avenue of research and, as such, we provide recommendations for their thorough evaluation. We extend the checklist of Carlini et al. (2019) by providing concrete steps specific to adaptive defenses.
    Model-Based Offline Meta-Reinforcement Learning with Regularization. (arXiv:2202.02929v2 [cs.LG] UPDATED)
    Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between "exploring" the out-of-distribution state-actions by following the meta-policy and "exploiting" the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we explore model-based offline Meta-RL with regularized Policy Optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.
    How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. (arXiv:2102.08921v2 [cs.LG] UPDATED)
    Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, ($\alpha$-Precision, $\beta$-Recall, Authenticity), that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity. We introduce generalization as an additional, independent dimension (to the fidelity-diversity trade-off) that quantifies the extent to which a model copies training data -- a crucial performance indicator when modeling sensitive data with requirements on privacy. The three metric components correspond to (interpretable) probabilistic quantities, and are estimated via sample-level binary classification. The sample-level nature of our metric inspires a novel use case which we call model auditing, wherein we judge the quality of individual samples generated by a (black-box) model, discarding low-quality samples and hence improving the overall model performance in a post-hoc manner.
    How to Train Your Wide Neural Network Without Backprop: An Input-Weight Alignment Perspective. (arXiv:2106.08453v2 [cs.LG] UPDATED)
    Recent works have examined theoretical and empirical properties of wide neural networks trained in the Neural Tangent Kernel (NTK) regime. Given that biological neural networks are much wider than their artificial counterparts, we consider NTK regime wide neural networks as a possible model of biological neural networks. Leveraging NTK theory, we show theoretically that gradient descent drives layerwise weight updates that are aligned with their input activity correlations weighted by error, and demonstrate empirically that the result also holds in finite-width wide networks. The alignment result allows us to formulate a family of biologically-motivated, backpropagation-free learning rules that are theoretically equivalent to backpropagation in infinite-width networks. We test these learning rules on benchmark problems in feedforward and recurrent neural networks and demonstrate, in wide networks, comparable performance to backpropagation. The proposed rules are particularly effective in low data regimes, which are common in biological learning settings.
    Sound and Complete Neural Network Repair with Minimality and Locality Guarantees. (arXiv:2110.07682v2 [cs.LG] UPDATED)
    We present a novel methodology for repairing neural networks that use ReLU activation functions. Unlike existing methods that rely on modifying the weights of a neural network which can induce a global change in the function space, our approach applies only a localized change in the function space while still guaranteeing the removal of the buggy behavior. By leveraging the piecewise linear nature of ReLU networks, our approach can efficiently construct a patch network tailored to the linear region where the buggy input resides, which when combined with the original network, provably corrects the behavior on the buggy input. Our method is both sound and complete -- the repaired network is guaranteed to fix the buggy input, and a patch is guaranteed to be found for any buggy input. Moreover, our approach preserves the continuous piecewise linear nature of ReLU networks, automatically generalizes the repair to all the points including other undetected buggy inputs inside the repair region, is minimal in terms of changes in the function space, and guarantees that outputs on inputs away from the repair region are unaltered. On several benchmarks, we show that our approach significantly outperforms existing methods in terms of locality and limiting negative side effects.
    Majorization-minimization for Sparse Nonnegative Matrix Factorization with the $\beta$-divergence. (arXiv:2207.06316v1 [cs.LG])
    This article introduces new multiplicative updates for nonnegative matrix factorization with the $\beta$-divergence and sparse regularization of one of the two factors (say, the activation matrix). It is well known that the norm of the other factor (the dictionary matrix) needs to be controlled in order to avoid an ill-posed formulation. Standard practice consists in constraining the columns of the dictionary to have unit norm, which leads to a nontrivial optimization problem. Our approach leverages a reparametrization of the original problem into the optimization of an equivalent scale-invariant objective function. From there, we derive block-descent majorization-minimization algorithms that result in simple multiplicative updates for either $\ell_{1}$-regularization or the more "aggressive" log-regularization. In contrast with other state-of-the-art methods, our algorithms are universal in the sense that they can be applied to any $\beta$-divergence (i.e., any value of $\beta$) and that they come with convergence guarantees. We report numerical comparisons with existing heuristic and Lagrangian methods using various datasets: face images, an audio spectrogram, hyperspectral data, and song play counts. We show that our methods obtain solutions of similar quality at convergence (similar objective values) but with significantly reduced CPU times.
    Multi-scale Hybrid Vision Transformer for Learning Gastric Cancer Histology. (arXiv:2202.08510v3 [eess.IV] UPDATED)
    Gastric endoscopic screening is an effective way to decide appropriate gastric cancer (GC) treatment at an early stage, reducing GC-associated mortality rate. Although artificial intelligence (AI) has brought a great promise to assist pathologist to screen digitalized whole slide images, existing AI systems are limited in fine-grained cancer subclassifications and have little usability in planning cancer treatment. We propose a practical AI system that enables five subclassifications of GC pathology, which can be directly matched to general GC treatment guidance. The AI system is designed to efficiently differentiate multi-classes of GC through multi-scale self-attention mechanism using 2-stage hybrid Vision Transformer (ViT) networks, by mimicking the way how human pathologists understand histology. The AI system demonstrates reliable diagnostic performance by achieving class-average sensitivity of above 0.85 on a total of 1,212 slides from multicentric cohort. Furthermore, AI-assisted pathologists show significantly improved diagnostic sensitivity by 12% in addition to 18% reduced screening time compared to human pathologists. Our results demonstrate that AI-assisted gastric endoscopic screening has a great potential for providing presumptive pathologic opinion and appropriate cancer treatment of gastric cancer in practical clinical settings.
    FedNST: Federated Noisy Student Training for Automatic Speech Recognition. (arXiv:2206.02797v2 [eess.AS] UPDATED)
    Federated Learning (FL) enables training state-of-the-art Automatic Speech Recognition (ASR) models on user devices (clients) in distributed systems, hence preventing transmission of raw user data to a central server. A key challenge facing practical adoption of FL for ASR is obtaining ground-truth labels on the clients. Existing approaches rely on clients to manually transcribe their speech, which is impractical for obtaining large training corpora. A promising alternative is using semi-/self-supervised learning approaches to leverage unlabelled user data. To this end, we propose FedNST, a novel method for training distributed ASR models using private and unlabelled user data. We explore various facets of FedNST, such as training models with different proportions of labelled and unlabelled data, and evaluate the proposed approach on 1173 simulated clients. Evaluating FedNST on LibriSpeech, where 960 hours of speech data is split equally into server (labelled) and client (unlabelled) data, showed a 22.5% relative word error rate reduction} (WERR) over a supervised baseline trained only on server data.
    Efficient Augmentation for Imbalanced Deep Learning. (arXiv:2207.06080v1 [cs.LG])
    Deep learning models memorize training data, which hurts their ability to generalize to under-represented classes. We empirically study a convolutional neural network's internal representation of imbalanced image data and measure the generalization gap between a model's feature embeddings in the training and test sets, showing that the gap is wider for minority classes. This insight enables us to design an efficient three-phase CNN training framework for imbalanced data. The framework involves training the network end-to-end on imbalanced data to learn accurate feature embeddings, performing data augmentation in the learned embedded space to balance the train distribution, and fine-tuning the classifier head on the embedded balanced training data. We propose Expansive Over-Sampling (EOS) as a data augmentation technique to utilize in the training framework. EOS forms synthetic training instances as convex combinations between the minority class samples and their nearest enemies in the embedded space to reduce the generalization gap. The proposed framework improves the accuracy over leading cost-sensitive and resampling methods commonly used in imbalanced learning. Moreover, it is more computationally efficient than standard data pre-processing methods, such as SMOTE and GAN-based oversampling, as it requires fewer parameters and less training time.
    Optimal Network Compression. (arXiv:2008.08733v5 [q-fin.RM] UPDATED)
    This paper introduces a formulation of the optimal network compression problem for financial systems. This general formulation is presented for different levels of network compression or rerouting allowed from the initial interbank network. We prove that this problem is, generically, NP-hard. We focus on objective functions generated by systemic risk measures under shocks to the financial network. We use this framework to study the (sub)optimality of the maximally compressed network. We conclude by studying the optimal compression problem for specific networks; this permits us to study, e.g., the so-called robust fragility of certain network topologies more generally as well as the potential benefits and costs of network compression. In particular, under systematic shocks and heterogeneous financial networks the robust fragility results of Acemoglu et al. (2015) no longer hold generally.
    Neural Network Robustness as a Verification Property: A Principled Case Study. (arXiv:2104.01396v2 [cs.LG] UPDATED)
    Neural networks are very successful at detecting patterns in noisy data, and have become the technology of choice in many fields. However, their usefulness is hampered by their susceptibility to adversarial attacks. Recently, many methods for measuring and improving a network's robustness to adversarial perturbations have been proposed, and this growing body of research has given rise to numerous explicit or implicit notions of robustness. Connections between these notions are often subtle, and a systematic comparison between them is missing in the literature. In this paper we begin addressing this gap, by setting up general principles for the empirical analysis and evaluation of a network's robustness as a mathematical property - during the network's training phase, its verification, and after its deployment. We then apply these principles and conduct a case study that showcases the practical benefits of our general approach.
    Masked Autoencoders that Listen. (arXiv:2207.06405v1 [cs.SD])
    This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.
    Automated Detection of Label Errors in Semantic Segmentation Datasets via Deep Learning and Uncertainty Quantification. (arXiv:2207.06104v1 [cs.CV])
    In this work, we for the first time present a method for detecting label errors in image datasets with semantic segmentation, i.e., pixel-wise class labels. Annotation acquisition for semantic segmentation datasets is time-consuming and requires plenty of human labor. In particular, review processes are time consuming and label errors can easily be overlooked by humans. The consequences are biased benchmarks and in extreme cases also performance degradation of deep neural networks (DNNs) trained on such datasets. DNNs for semantic segmentation yield pixel-wise predictions, which makes detection of label errors via uncertainty quantification a complex task. Uncertainty is particularly pronounced at the transitions between connected components of the prediction. By lifting the consideration of uncertainty to the level of predicted components, we enable the usage of DNNs together with component-level uncertainty quantification for the detection of label errors. We present a principled approach to benchmarking the task of label error detection by dropping labels from the Cityscapes dataset as well from a dataset extracted from the CARLA driving simulator, where in the latter case we have the labels under control. Our experiments show that our approach is able to detect the vast majority of label errors while controlling the number of false label error detections. Furthermore, we apply our method to semantic segmentation datasets frequently used by the computer vision community and present a collection of label errors along with sample statistics.
    Simplex NeuPL: Any-Mixture Bayes-Optimality in Symmetric Zero-sum Games. (arXiv:2205.15879v3 [cs.AI] UPDATED)
    Learning to play optimally against any mixture over a diverse set of strategies is of important practical interests in competitive games. In this paper, we propose simplex-NeuPL that satisfies two desiderata simultaneously: i) learning a population of strategically diverse basis policies, represented by a single conditional network; ii) using the same network, learn best-responses to any mixture over the simplex of basis policies. We show that the resulting conditional policies incorporate prior information about their opponents effectively, enabling near optimal returns against arbitrary mixture policies in a game with tractable best-responses. We verify that such policies behave Bayes-optimally under uncertainty and offer insights in using this flexibility at test time. Finally, we offer evidence that learning best-responses to any mixture policies is an effective auxiliary task for strategic exploration, which, by itself, can lead to more performant populations.
    On the Opportunities and Risks of Foundation Models. (arXiv:2108.07258v3 [cs.LG] UPDATED)
    AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
    Surrogate Likelihoods for Variational Annealed Importance Sampling. (arXiv:2112.12194v2 [stat.ML] UPDATED)
    Variational inference is a powerful paradigm for approximate Bayesian inference with a number of appealing properties, including support for model learning and data subsampling. By contrast MCMC methods like Hamiltonian Monte Carlo do not share these properties but remain attractive since, contrary to parametric methods, MCMC is asymptotically unbiased. For these reasons researchers have sought to combine the strengths of both classes of algorithms, with recent approaches coming closer to realizing this vision in practice. However, supporting data subsampling in these hybrid methods can be a challenge, a shortcoming that we address by introducing a surrogate likelihood that can be learned jointly with other variational parameters. We argue theoretically that the resulting algorithm permits the user to make an intuitive trade-off between inference fidelity and computational cost. In an extensive empirical comparison we show that our method performs well in practice and that it is well-suited for black-box inference in probabilistic programming frameworks.
    ARMAS: Active Reconstruction of Missing Audio Segments. (arXiv:2111.10891v3 [eess.AS] UPDATED)
    Digital audio signal reconstruction of a lost or corrupt segment using deep learning algorithms has been explored intensively in recent years. Nevertheless, prior traditional methods with linear interpolation, phase coding and tone insertion techniques are still in vogue. However, we found no research work on reconstructing audio signals with the fusion of dithering, steganography, and machine learning regressors. Therefore, this paper proposes the combination of steganography, halftoning (dithering), and state-of-the-art shallow (RF- Random Forest regression) and deep learning (LSTM- Long Short-Term Memory) methods. The results (including comparing the SPAIN, Autoregressive, deep learning-based, graph-based, and other methods) are evaluated with three different metrics. The observations from the results show that the proposed solution is effective and can enhance the reconstruction of audio signals performed by the side information (e.g., Latent representation and learning for audio inpainting) steganography provides. Moreover, this paper proposes a novel framework for reconstruction from heavily compressed embedded audio data using halftoning (i.e., dithering) and machine learning, which we termed the HCR (halftone-based compression and reconstruction). This work may trigger interest in optimising this approach and/or transferring it to different domains (i.e., image reconstruction). Compared to existing methods, we show improvement in the inpainting performance in terms of signal-to-noise (SNR), the objective difference grade (ODG) and the Hansen's audio quality metric.
    The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation. (arXiv:2109.13419v6 [cs.LG] UPDATED)
    Function approximation is widely used in reinforcement learning to handle the computational difficulties associated with very large state spaces. However, function approximation introduces errors which may lead to instabilities when using approximate dynamic programming techniques to obtain the optimal policy. Therefore, techniques such as lookahead for policy improvement and m-step rollout for policy evaluation are used in practice to improve the performance of approximate dynamic programming with function approximation. We quantitatively characterize, for the first time, the impact of lookahead and m-step rollout on the performance of approximate dynamic programming (DP) with function approximation: (i) without a sufficient combination of lookahead and m-step rollout, approximate DP may not converge, (ii) both lookahead and m-step rollout improve the convergence rate of approximate DP, and (iii) lookahead helps mitigate the effect of function approximation and the discount factor on the asymptotic performance of the algorithm. Our results are presented for two approximate DP methods: one which uses least-squares regression to perform function approximation and another which performs several steps of gradient descent of the least-squares objective in each iteration.
    Training Robust Deep Models for Time-Series Domain: Novel Algorithms and Theoretical Analysis. (arXiv:2207.04305v2 [cs.LG] UPDATED)
    Despite the success of deep neural networks (DNNs) for real-world applications over time-series data such as mobile health, little is known about how to train robust DNNs for time-series domain due to its unique characteristics compared to images and text data. In this paper, we propose a novel algorithmic framework referred as RObust Training for Time-Series (RO-TS) to create robust DNNs for time-series classification tasks. Specifically, we formulate a min-max optimization problem over the model parameters by explicitly reasoning about the robustness criteria in terms of additive perturbations to time-series inputs measured by the global alignment kernel (GAK) based distance. We also show the generality and advantages of our formulation using the summation structure over time-series alignments by relating both GAK and dynamic time warping (DTW). This problem is an instance of a family of compositional min-max optimization problems, which are challenging and open with unclear theoretical guarantee. We propose a principled stochastic compositional alternating gradient descent ascent (SCAGDA) algorithm for this family of optimization problems. Unlike traditional methods for time-series that require approximate computation of distance measures, SCAGDA approximates the GAK based distance on-the-fly using a moving average approach. We theoretically analyze the convergence rate of SCAGDA and provide strong theoretical support for the estimation of GAK based distance. Our experiments on real-world benchmarks demonstrate that RO-TS creates more robust DNNs when compared to adversarial training using prior methods that rely on data augmentation or new definitions of loss functions. We also demonstrate the importance of GAK for time-series data over the Euclidean distance. The source code of RO-TS algorithms is available at https://github.com/tahabelkhouja/Robust-Training-for-Time-Series
    Towards Meta-learned Algorithm Selection using Implicit Fidelity Information. (arXiv:2206.03130v2 [cs.LG] UPDATED)
    Automatically selecting the best performing algorithm for a given dataset or ranking multiple algorithms by their expected performance supports users in developing new machine learning applications. Most approaches for this problem rely on pre-computed dataset meta-features and landmarking performances to capture the salient topology of the datasets and those topologies that the algorithms attend to. Landmarking usually exploits cheap algorithms not necessarily in the pool of candidate algorithms to get inexpensive approximations of the topology. While somewhat indicative, hand-crafted dataset meta-features and landmarks are likely insufficient descriptors, strongly depending on the alignment of the topologies that the landmarks and the candidate algorithms search for. We propose IMFAS, a method to exploit multi-fidelity landmarking information directly from the candidate algorithms in the form of non-parametrically non-myopic meta-learned learning curves via LSTMs in a few-shot setting during testing. Using this mechanism, IMFAS jointly learns the topology of the datasets and the inductive biases of the candidate algorithms, without the need to expensively train them to convergence. Our approach produces informative landmarks, easily enriched by arbitrary meta-features at a low computational cost, capable of producing the desired ranking using cheaper fidelities. We additionally show that IMFAS is able to beat Successive Halving with at most 50% of the fidelity sequence during test time.
    Smooth Anonymity for Sparse Binary Matrices. (arXiv:2207.06358v1 [cs.CR])
    When working with user data providing well-defined privacy guarantees is paramount. In this work we aim to manipulate and share an entire sparse dataset with a third party privately. In fact, differential privacy has emerged as the gold standard of privacy, however, when it comes to sharing sparse datasets, as one of our main results, we prove that \emph{any} differentially private mechanism that maintains a reasonable similarity with the initial dataset is doomed to have a very weak privacy guarantee. Hence we need to opt for other privacy notions such as $k$-anonymity are better at preserving utility in this context. In this work we present a variation of $k$-anonymity, which we call smooth $k$-anonymity and design simple algorithms that efficiently provide smooth $k$-anonymity. We further perform an empirical evaluation to back our theoretical guarantees, and show that our algorithm improves the performance in downstream machine learning tasks on anonymized data.
    Tuning the Geometry of Graph Neural Networks. (arXiv:2207.05887v1 [cs.LG])
    By recursively summing node features over entire neighborhoods, spatial graph convolution operators have been heralded as key to the success of Graph Neural Networks (GNNs). Yet, despite the multiplication of GNN methods across tasks and applications, the impact of this aggregation operation on their performance still has yet to be extensively analysed. In fact, while efforts have mostly focused on optimizing the architecture of the neural network, fewer works have attempted to characterize (a) the different classes of spatial convolution operators, (b) how the choice of a particular class relates to properties of the data , and (c) its impact on the geometry of the embedding space. In this paper, we propose to answer all three questions by dividing existing operators into two main classes ( symmetrized vs. row-normalized spatial convolutions), and show how these translate into different implicit biases on the nature of the data. Finally, we show that this aggregation operator is in fact tunable, and explicit regimes in which certain choices of operators -- and therefore, embedding geometries -- might be more appropriate.
    SURIMI: Supervised Radio Map Augmentation with Deep Learning and a Generative Adversarial Network for Fingerprint-based Indoor Positioning. (arXiv:2207.06120v1 [eess.SP])
    Indoor Positioning based on Machine Learning has drawn increasing attention both in the academy and the industry as meaningful information from the reference data can be extracted. Many researchers are using supervised, semi-supervised, and unsupervised Machine Learning models to reduce the positioning error and offer reliable solutions to the end-users. In this article, we propose a new architecture by combining Convolutional Neural Network (CNN), Long short-term memory (LSTM) and Generative Adversarial Network (GAN) in order to increase the training data and thus improve the position accuracy. The proposed combination of supervised and unsupervised models was tested in 17 public datasets, providing an extensive analysis of its performance. As a result, the positioning error has been reduced in more than 70% of them.
    Parameterized Convex Universal Approximators for Decision-Making Problems. (arXiv:2201.06298v2 [cs.LG] UPDATED)
    Parameterized max-affine (PMA) and parameterized log-sum-exp (PLSE) networks are proposed for general decision-making problems. The proposed approximators generalize existing convex approximators, namely, max-affine (MA) and log-sum-exp (LSE) networks, by considering function arguments of condition and decision variables and replacing the network parameters of MA and LSE networks with continuous functions with respect to the condition variable. The universal approximation theorem of PMA and PLSE is proven, which implies that PMA and PLSE are shape-preserving universal approximators for parameterized convex continuous functions. Practical guidelines for incorporating deep neural networks within PMA and PLSE networks are provided. A numerical simulation is performed to demonstrate the performance of the proposed approximators. The simulation results support that PLSE outperforms other existing approximators in terms of minimizer and optimal value errors with scalable and efficient computation for high-dimensional cases.
    MRF-UNets: Searching UNet with Markov Random Fields. (arXiv:2207.06168v1 [cs.LG])
    UNet [27] is widely used in semantic segmentation due to its simplicity and effectiveness. However, its manually-designed architecture is applied to a large number of problem settings, either with no architecture optimizations, or with manual tuning, which is time consuming and can be sub-optimal. In this work, firstly, we propose Markov Random Field Neural Architecture Search (MRF-NAS) that extends and improves the recent Adaptive and Optimal Network Width Search (AOWS) method [4] with (i) a more general MRF framework (ii) diverse M-best loopy inference (iii) differentiable parameter learning. This provides the necessary NAS framework to efficiently explore network architectures that induce loopy inference graphs, including loops that arise from skip connections. With UNet as the backbone, we find an architecture, MRF-UNet, that shows several interesting characteristics. Secondly, through the lens of these characteristics, we identify the sub-optimality of the original UNet architecture and further improve our results with MRF-UNetV2. Experiments show that our MRF-UNets significantly outperform several benchmarks on three aerial image datasets and two medical image datasets while maintaining low computational costs. The code is available at: https://github.com/zifuwanggg/MRF-UNets.
    ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech. (arXiv:2207.06389v1 [eess.AS])
    Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hinder their applications to text-to-speech deployment. Through the preliminary study on diffusion model parameterization, we find that previous gradient-based TTS models require hundreds or thousands of iterations to guarantee high sample quality, which poses a challenge for accelerating sampling. In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. Unlike previous work estimating the gradient for data density, ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. To tackle the model convergence challenge with decreased diffusion iterations, ProDiff reduces the data variance in the target site via knowledge distillation. Specifically, the denoising model uses the generated mel-spectrogram from an N-step DDIM teacher as the training target and distills the behavior into a new model with N/2 steps. As such, it allows the TTS model to make sharp predictions and further reduces the sampling time by orders of magnitude. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms, while it maintains sample quality and diversity competitive with state-of-the-art models using hundreds of steps. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU, making diffusion models practically applicable to text-to-speech synthesis deployment for the first time. Our extensive ablation studies demonstrate that each design in ProDiff is effective, and we further show that ProDiff can be easily extended to the multi-speaker setting. Audio samples are available at \url{https://ProDiff.github.io/.}
    Multi-Study Boosting: Theoretical Considerations for Merging vs. Ensembling. (arXiv:2207.04588v2 [stat.ML] UPDATED)
    Cross-study replicability is a powerful model evaluation criterion that emphasizes generalizability of predictions. When training cross-study replicable prediction models, it is critical to decide between merging and treating the studies separately. We study boosting algorithms in the presence of potential heterogeneity in predictor-outcome relationships across studies and compare two multi-study learning strategies: 1) merging all the studies and training a single model, and 2) multi-study ensembling, which involves training a separate model on each study and ensembling the resulting predictions. In the regression setting, we provide theoretical guidelines based on an analytical transition point to determine whether it is more beneficial to merge or to ensemble for boosting with linear learners. In addition, we characterize a bias-variance decomposition of estimation error for boosting with component-wise linear learners. We verify the theoretical transition point result in simulation and illustrate how it can guide the decision on merging vs. ensembling in an application to breast cancer gene expression data.
    Hindsight Learning for MDPs with Exogenous Inputs. (arXiv:2207.06272v1 [cs.LG])
    We develop a reinforcement learning (RL) framework for applications that deal with sequential decisions and exogenous uncertainty, such as resource allocation and inventory management. In these applications, the uncertainty is only due to exogenous variables like future demands. A popular approach is to predict the exogenous variables using historical data and then plan with the predictions. However, this indirect approach requires high-fidelity modeling of the exogenous process to guarantee good downstream decision-making, which can be impractical when the exogenous process is complex. In this work we propose an alternative approach based on hindsight learning which sidesteps modeling the exogenous process. Our key insight is that, unlike Sim2Real RL, we can revisit past decisions in the historical data and derive counterfactual consequences for other actions in these applications. Our framework uses hindsight-optimal actions as the policy training signal and has strong theoretical guarantees on decision-making performance. We develop an algorithm using our framework to allocate compute resources for real-world Microsoft Azure workloads. The results show our approach learns better policies than domain-specific heuristics and Sim2Real RL baselines.
    Hierarchy exploitation to detect missing annotations on hierarchical multi-label classification. (arXiv:2207.06237v1 [cs.LG])
    The availability of genomic data has grown exponentially in the last decade, mainly due to the development of new sequencing technologies. Based on the interactions between genes (and gene products) extracted from the increasing genomic data, numerous studies have focused on the identification of associations between genes and functions. While these studies have shown great promise, the problem of annotating genes with functions remains an open challenge. In this work, we present a method to detect missing annotations in hierarchical multi-label classification datasets. We propose a method that exploits the class hierarchy by computing aggregated probabilities to the paths of classes from the leaves to the root for each instance. The proposed method is presented in the context of predicting missing gene function annotations, where these aggregated probabilities are further used to select a set of annotations to be verified through in vivo experiments. The experiments on Oriza sativa Japonica, a variety of rice, showcase that incorporating the hierarchy of classes into the method often improves the predictive performance and our proposed method yields superior results when compared to competitor methods from the literature.
    High Per Parameter: A Large-Scale Study of Hyperparameter Tuning for Machine Learning Algorithms. (arXiv:2207.06028v1 [cs.LG])
    Hyperparameters in machine learning (ML) have received a fair amount of attention, and hyperparameter tuning has come to be regarded as an important step in the ML pipeline. But just how useful is said tuning? While smaller-scale experiments have been previously conducted, herein we carry out a large-scale investigation, specifically, one involving 26 ML algorithms, 250 datasets (regression and both binary and multinomial classification), 6 score metrics, and 28,857,600 algorithm runs. Analyzing the results we conclude that for many ML algorithms we should not expect considerable gains from hyperparameter tuning on average, however, there may be some datasets for which default hyperparameters perform poorly, this latter being truer for some algorithms than others. By defining a single hp_score value, which combines an algorithm's accumulated statistics, we are able to rank the 26 ML algorithms from those expected to gain the most from hyperparameter tuning to those expected to gain the least. We believe such a study may serve ML practitioners at large.
    Electromagnetic Source Imaging via a Data-Synthesis-Based Convolutional Encoder-Decoder Network. (arXiv:2010.12876v6 [eess.IV] UPDATED)
    Electromagnetic source imaging (ESI) requires solving a highly ill-posed inverse problem. To seek a unique solution, traditional ESI methods impose various forms of priors that may not accurately reflect the actual source properties, which may hinder their broad applications. To overcome this limitation, in this paper a novel data-synthesized spatio-temporally convolutional encoder-decoder network method termed DST-CedNet is proposed for ESI. DST-CedNet recasts ESI as a machine learning problem, where discriminative learning and latent-space representations are integrated in a convolutional encoder-decoder network (CedNet) to learn a robust mapping from the measured electroencephalography/magnetoencephalography (E/MEG) signals to the brain activity. In particular, by incorporating prior knowledge regarding dynamical brain activities, a novel data synthesis strategy is devised to generate large-scale samples for effectively training CedNet. This stands in contrast to traditional ESI methods where the prior information is often enforced via constraints primarily aimed for mathematical convenience. Extensive numerical experiments as well as analysis of a real MEG and Epilepsy EEG dataset demonstrate that DST-CedNet outperforms several state-of-the-art ESI methods in robustly estimating source signals under a variety of source configurations.
    Graph Property Prediction on Open Graph Benchmark: A Winning Solution by Graph Neural Architecture Search. (arXiv:2207.06027v1 [cs.LG])
    Aiming at two molecular graph datasets and one protein association subgraph dataset in OGB graph classification task, we design a graph neural network framework for graph classification task by introducing PAS(Pooling Architecture Search). At the same time, we improve it based on the GNN topology design method F2GNN to further design the feature selection and fusion strategies, so as to further improve the performance of the model in the graph property prediction task while overcoming the over smoothing problem of deep GNN training. Finally, a performance breakthrough is achieved on these three datasets, which is significantly better than other methods with fixed aggregate function. It is proved that the NAS method has high generalization ability for multiple tasks and the advantage of our method in processing graph property prediction tasks.
    Deep Transformer Model with Pre-Layer Normalization for COVID-19 Growth Prediction. (arXiv:2207.06356v1 [cs.LG])
    Coronavirus disease or COVID-19 is an infectious disease caused by the SARS-CoV-2 virus. The first confirmed case caused by this virus was found at the end of December 2019 in Wuhan City, China. This case then spread throughout the world, including Indonesia. Therefore, the COVID-19 case was designated as a global pandemic by WHO. The growth of COVID-19 cases, especially in Indonesia, can be predicted using several approaches, such as the Deep Neural Network (DNN). One of the DNN models that can be used is Deep Transformer which can predict time series. The model is trained with several test scenarios to get the best model. The evaluation is finding the best hyperparameters. Then, further evaluation was carried out using the best hyperparameters setting of the number of prediction days, the optimizer, the number of features, and comparison with the former models of the Long Short-Term Memory (LSTM) and Recurrent Neural Network (RNN). All evaluations used metric of the Mean Absolute Percentage Error (MAPE). Based on the results of the evaluations, Deep Transformer produces the best results when using the Pre-Layer Normalization and predicting one day ahead with a MAPE value of 18.83. Furthermore, the model trained with the Adamax optimizer obtains the best performance among other tested optimizers. The performance of the Deep Transformer also exceeds other test models, which are LSTM and RNN.
    Cost-Effective Online Contextual Model Selection. (arXiv:2207.06030v1 [cs.LG])
    How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.
    TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels. (arXiv:2207.06343v1 [cs.LG])
    State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can largely be attributed to optimization challenges presented by nonconvexity. Specifically, we find that the early layers of the network do learn useful features, but the final layers fail to make use of them. That is, federated optimization applied to this non-convex problem distorts the learning of the final layers. Leveraging this observation, we propose a Train-Convexify-Train (TCT) procedure to sidestep this issue: first, learn features using off-the-shelf methods (e.g., FedAvg); then, optimize a convexified problem obtained from the network's empirical neural tangent kernel approximation. Our technique yields accuracy improvements of up to +36% on FMNIST and +37% on CIFAR10 when clients have dissimilar data.
    Continual Learning with Deep Learning Methods in an Application-Oriented Context. (arXiv:2207.06233v1 [cs.LG])
    Abstract knowledge is deeply grounded in many computer-based applications. An important research area of Artificial Intelligence (AI) deals with the automatic derivation of knowledge from data. Machine learning offers the according algorithms. One area of research focuses on the development of biologically inspired learning algorithms. The respective machine learning methods are based on neurological concepts so that they can systematically derive knowledge from data and store it. One type of machine learning algorithms that can be categorized as "deep learning" model is referred to as Deep Neural Networks (DNNs). DNNs consist of multiple artificial neurons arranged in layers that are trained by using the backpropagation algorithm. These deep learning methods exhibit amazing capabilities for inferring and storing complex knowledge from high-dimensional data. However, DNNs are affected by a problem that prevents new knowledge from being added to an existing base. The ability to continuously accumulate knowledge is an important factor that contributed to evolution and is therefore a prerequisite for the development of strong AIs. The so-called "catastrophic forgetting" (CF) effect causes DNNs to immediately loose already derived knowledge after a few training iterations on a new data distribution. Only an energetically expensive retraining with the joint data distribution of past and new data enables the abstraction of the entire new set of knowledge. In order to counteract the effect, various techniques have been and are still being developed with the goal to mitigate or even solve the CF problem. These published CF avoidance studies usually imply the effectiveness of their approaches for various continual learning tasks. This dissertation is set in the context of continual machine learning with deep learning methods. The first part deals with the development of an ...
    Task Agnostic Representation Consolidation: a Self-supervised based Continual Learning Approach. (arXiv:2207.06267v1 [cs.LG])
    Continual learning (CL) over non-stationary data streams remains one of the long-standing challenges in deep neural networks (DNNs) as they are prone to catastrophic forgetting. CL models can benefit from self-supervised pre-training as it enables learning more generalizable task-agnostic features. However, the effect of self-supervised pre-training diminishes as the length of task sequences increases. Furthermore, the domain shift between pre-training data distribution and the task distribution reduces the generalizability of the learned representations. To address these limitations, we propose Task Agnostic Representation Consolidation (TARC), a two-stage training paradigm for CL that intertwines task-agnostic and task-specific learning whereby self-supervised training is followed by supervised learning for each task. To further restrict the deviation from the learned representations in the self-supervised stage, we employ a task-agnostic auxiliary loss during the supervised stage. We show that our training paradigm can be easily added to memory- or regularization-based approaches and provides consistent performance gain across more challenging CL settings. We further show that it leads to more robust and well-calibrated models.
    Continual Meta-Reinforcement Learning for UAV-Aided Vehicular Wireless Networks. (arXiv:2207.06131v1 [cs.LG])
    Unmanned aerial base stations (UABSs) can be deployed in vehicular wireless networks to support applications such as extended sensing via vehicle-to-everything (V2X) services. A key problem in such systems is designing algorithms that can efficiently optimize the trajectory of the UABS in order to maximize coverage. In existing solutions, such optimization is carried out from scratch for any new traffic configuration, often by means of conventional reinforcement learning (RL). In this paper, we propose the use of continual meta-RL as a means to transfer information from previously experienced traffic configurations to new conditions, with the goal of reducing the time needed to optimize the UABS's policy. Adopting the Continual Meta Policy Search (CoMPS) strategy, we demonstrate significant efficiency gains as compared to conventional RL, as well as to naive transfer learning methods.
    Learning Approximately Optimal Contracts. (arXiv:1811.06736v2 [cs.GT] UPDATED)
    In principal-agent models, a principal offers a contract to an agent to perform a certain task. The agent exerts a level of effort that maximizes her utility. The principal is oblivious to the agent's chosen level of effort, and conditions her wage only on possible outcomes. In this work, we consider a model in which the principal is unaware of the agent's utility and action space: she sequentially offers contracts to identical agents, and observes the resulting outcomes. We present an algorithm for learning the optimal contract under mild assumptions. We bound the number of samples needed for the principal to obtain a contract that is within $\eps$ of her optimal net profit for every $\eps>0$. Our results are robust even when considering risk-averse agents. Furthermore, we show that when there are only two possible outcomes or the agent is risk-neutral, the algorithm's outcome approximates the optimal contract described in the classical theory.
    Beyond Hard Labels: Investigating data label distributions. (arXiv:2207.06224v1 [cs.CV])
    High-quality data is a key aspect of modern machine learning. However, labels generated by humans suffer from issues like label noise and class ambiguities. We raise the question of whether hard labels are sufficient to represent the underlying ground truth distribution in the presence of these inherent imprecision. Therefore, we compare the disparity of learning with hard and soft labels quantitatively and qualitatively for a synthetic and a real-world dataset. We show that the application of soft labels leads to improved performance and yields a more regular structure of the internal feature space.
    URANUS: Radio Frequency Tracking, Classification and Identification of Unmanned Aircraft Vehicles. (arXiv:2207.06025v1 [cs.LG])
    Safety and security issues for Critical Infrastructures (CI) are growing as attackers increasingly adopt drones as an attack vector flying in sensitive airspace, such as airports, military bases, city centres, and crowded places. The rapid proliferation of drones for merchandise, shipping recreations activities, and other commercial applications poses severe concerns on the CI operators due to the violations and the invasions of the restricted airspaces. A cost-effective framework is needed to detect, classify and identify the presence of drones in such cases. In this paper, we demonstrate that CI operators can detect, classify and identify timely and efficiently drones (multi-copter and fixed-wings) invading no-drone zones, with an inexpensive RF-based detection framework named URANUS. Our experiments show that by using Random Forest classifier, we achieved a classification accuracy of 93.4% in the classification of one or multiple specific drones. The tracking performance achieves an accuracy with an average of MAE=0.3650, MSE=0.9254 and R2 = 0.7502. Our framework has been released as open-source, to enable the community to verify our findings and use URANUS as a ready-to-use basis for further analysis.
    A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP. (arXiv:2207.06147v1 [cs.LG])
    As an important framework for safe Reinforcement Learning, the Constrained Markov Decision Process (CMDP) has been extensively studied in the recent literature. However, despite the rich results under various on-policy learning settings, there still lacks some essential understanding of the offline CMDP problems, in terms of both the algorithm design and the information theoretic sample complexity lower bound. In this paper, we focus on solving the CMDP problems where only offline data are available. By adopting the concept of the single-policy concentrability coefficient $C^*$, we establish an $\Omega\left(\frac{\min\left\{|\mathcal{S}||\mathcal{A}|,|\mathcal{S}|+I\right\} C^*}{(1-\gamma)^3\epsilon^2}\right)$ sample complexity lower bound for the offline CMDP problem, where $I$ stands for the number of constraints. By introducing a simple but novel deviation control mechanism, we propose a near-optimal primal-dual learning algorithm called DPDL. This algorithm provably guarantees zero constraint violation and its sample complexity matches the above lower bound except for an $\tilde{\mathcal{O}}((1-\gamma)^{-1})$ factor. Comprehensive discussion on how to deal with the unknown constant $C^*$ and the potential asynchronous structure on the offline dataset are also included.
    Machine Learning Application in Health. (arXiv:2207.06228v1 [cs.LG])
    Coronavirus can be transmitted through the air by close proximity to infected persons. Commercial aircraft are a likely way to both transmit the virus among passengers and move the virus between locations. The importance of learning about where and how coronavirus has entered the United States will help further our understanding of the disease. Air travelers can come from countries or areas with a high rate of infection and may very well be at risk of being exposed to the virus. Therefore, as they reach the United States, the virus could easily spread. On our analysis, we utilized machine learning to determine if the number of flights into the Washington DC Metro Area had an effect on the number of cases and deaths reported in the city and surrounding area.
    Stochastic Functional Analysis and Multilevel Vector Field Anomaly Detection. (arXiv:2207.06229v1 [stat.ML])
    Massive vector field datasets are common in multi-spectral optical and radar sensors and modern multimodal MRI data, among many other areas of application. In this paper we develop a novel stochastic functional analysis approach for detecting anomalies based on the covariance structure of nominal stochastic behavior across a domain with multi-band vector field data. An optimal vector field Karhunen-Loeve (KL) expansion is applied to such random field data. A series of multilevel orthogonal functional subspaces is constructed from the geometry of the domain, adapted from the KL expansion. Detection is achieved by examining the projection of the random field on the multilevel basis. The anomalies can be quantified in suitable normed spaces based on local and global information. In addition, reliable hypothesis tests are formed with controllable distributions that do not require prior assumptions on probability distributions of the data. Only the covariance function is needed, which makes for significantly simpler estimates. Furthermore this approach allows stochastic vector-based fusion of anomalies without any loss of information. The method is applied to the important problem of deforestation and degradation in the Amazon forest. This is a complex non-monotonic process, as forests can degrade and recover. This particular problem is further compounded by the presence of clouds that are hard to remove with current masking algorithms. Using multi-spectral satellite data from Sentinel 2, the multilevel filter is constructed and anomalies are treated as deviations from the initial state of the forest. Forest anomalies are quantified with robust hypothesis tests and distinguished from false variations such as cloud cover. Our approach shows the advantage of using multiple bands of data in a vectorized complex, leading to better anomaly detection beyond the capabilities of scalar-based methods.
    Distilled Non-Semantic Speech Embeddings with Binary Neural Networks for Low-Resource Devices. (arXiv:2207.05784v1 [cs.SD])
    This work introduces BRILLsson, a novel binary neural network-based representation learning model for a broad range of non-semantic speech tasks. We train the model with knowledge distillation from a large and real-valued TRILLsson model with only a fraction of the dataset used to train TRILLsson. The resulting BRILLsson models are only 2MB in size with a latency less than 8ms, making them suitable for deployment in low-resource devices such as wearables. We evaluate BRILLsson on eight benchmark tasks (including but not limited to spoken language identification, emotion recognition, heath condition diagnosis, and keyword spotting), and demonstrate that our proposed ultra-light and low-latency models perform as well as large-scale models.
    Look-ups are not (yet) all you need for deep learning inference. (arXiv:2207.05808v1 [cs.LG])
    Fast approximations to matrix multiplication have the potential to dramatically reduce the cost of neural network inference. Recent work on approximate matrix multiplication proposed to replace costly multiplications with table-lookups by fitting a fast hash function from training data. In this work, we propose improvements to this previous work, targeted to the deep learning inference setting, where one has access to both training data and fixed (already learned) model weight matrices. We further propose a fine-tuning procedure for accelerating entire neural networks while minimizing loss in accuracy. Finally, we analyze the proposed method on a simple image classification task. While we show improvements to prior work, overall classification accuracy remains substantially diminished compared to exact matrix multiplication. Our work, despite this negative result, points the way towards future efforts to accelerate inner products with fast nonlinear hashing methods.
    A Transfer Learning Based Model for Text Readability Assessment in German. (arXiv:2207.06265v1 [cs.CL])
    Text readability assessment has a wide range of applications for different target people, from language learners to people with disabilities. The fast pace of textual content production on the web makes it impossible to measure text complexity without the benefit of machine learning and natural language processing techniques. Although various research addressed the readability assessment of English text in recent years, there is still room for improvement of the models for other languages. In this paper, we proposed a new model for text complexity assessment for German text based on transfer learning. Our results show that the model outperforms more classical solutions based on linguistic features extraction from input text. The best model is based on the BERT pre-trained language model achieved the Root Mean Square Error (RMSE) of 0.483.
    Contextual Bandits with Large Action Spaces: Made Practical. (arXiv:2207.05836v1 [cs.LG])
    A central problem in sequential decision making is to develop algorithms that are practical and computationally efficient, yet support the use of flexible, general-purpose models. Focusing on the contextual bandit problem, recent progress provides provably efficient algorithms with strong empirical performance when the number of possible alternatives ("actions") is small, but guarantees for decision making in large, continuous action spaces have remained elusive, leading to a significant gap between theory and practice. We present the first efficient, general-purpose algorithm for contextual bandits with continuous, linearly structured action spaces. Our algorithm makes use of computational oracles for (i) supervised learning, and (ii) optimization over the action space, and achieves sample complexity, runtime, and memory independent of the size of the action space. In addition, it is simple and practical. We perform a large-scale empirical evaluation, and show that our approach typically enjoys superior performance and efficiency compared to standard baselines.
    Exploring Adversarial Examples and Adversarial Robustness of Convolutional Neural Networks by Mutual Information. (arXiv:2207.05756v1 [cs.LG])
    A counter-intuitive property of convolutional neural networks (CNNs) is their inherent susceptibility to adversarial examples, which severely hinders the application of CNNs in security-critical fields. Adversarial examples are similar to original examples but contain malicious perturbations. Adversarial training is a simple and effective training method to improve the robustness of CNNs to adversarial examples. The mechanisms behind adversarial examples and adversarial training are worth exploring. Therefore, this work investigates similarities and differences between two types of CNNs (both normal and robust ones) in information extraction by observing the trends towards the mutual information. We show that 1) the amount of mutual information that CNNs extract from original and adversarial examples is almost similar, whether CNNs are in normal training or adversarial training; the reason why adversarial examples mislead CNNs may be that they contain more texture-based information about other categories; 2) compared with normal training, adversarial training is more difficult and the amount of information extracted by the robust CNNs is less; 3) the CNNs trained with different methods have different preferences for certain types of information; normally trained CNNs tend to extract texture-based information from the inputs, while adversarially trained models prefer to shape-based information. Furthermore, we also analyze the mutual information estimators used in this work, kernel-density-estimation and binning methods, and find that these estimators outline the geometric properties of the middle layer's output to a certain extent.
    Differentially Private Linear Bandits with Partial Distributed Feedback. (arXiv:2207.05827v1 [cs.LG])
    In this paper, we study the problem of global reward maximization with only partial distributed feedback. This problem is motivated by several real-world applications (e.g., cellular network configuration, dynamic pricing, and policy selection) where an action taken by a central entity influences a large population that contributes to the global reward. However, collecting such reward feedback from the entire population not only incurs a prohibitively high cost but often leads to privacy concerns. To tackle this problem, we consider differentially private distributed linear bandits, where only a subset of users from the population are selected (called clients) to participate in the learning process and the central server learns the global model from such partial feedback by iteratively aggregating these clients' local feedback in a differentially private fashion. We then propose a unified algorithmic learning framework, called differentially private distributed phased elimination (DP-DPE), which can be naturally integrated with popular differential privacy (DP) models (including central DP, local DP, and shuffle DP). Furthermore, we prove that DP-DPE achieves both sublinear regret and sublinear communication cost. Interestingly, DP-DPE also achieves privacy protection "for free" in the sense that the additional cost due to privacy guarantees is a lower-order additive term. In addition, as a by-product of our techniques, the same results of "free" privacy can also be achieved for the standard differentially private linear bandits. Finally, we conduct simulations to corroborate our theoretical results and demonstrate the effectiveness of DP-DPE.
    Game of Trojans: A Submodular Byzantine Approach. (arXiv:2207.05937v1 [cs.LG])
    Machine learning models in the wild have been shown to be vulnerable to Trojan attacks during training. Although many detection mechanisms have been proposed, strong adaptive attackers have been shown to be effective against them. In this paper, we aim to answer the questions considering an intelligent and adaptive adversary: (i) What is the minimal amount of instances required to be Trojaned by a strong attacker? and (ii) Is it possible for such an attacker to bypass strong detection mechanisms? We provide an analytical characterization of adversarial capability and strategic interactions between the adversary and detection mechanism that take place in such models. We characterize adversary capability in terms of the fraction of the input dataset that can be embedded with a Trojan trigger. We show that the loss function has a submodular structure, which leads to the design of computationally efficient algorithms to determine this fraction with provable bounds on optimality. We propose a Submodular Trojan algorithm to determine the minimal fraction of samples to inject a Trojan trigger. To evade detection of the Trojaned model, we model strategic interactions between the adversary and Trojan detection mechanism as a two-player game. We show that the adversary wins the game with probability one, thus bypassing detection. We establish this by proving that output probability distributions of a Trojan model and a clean model are identical when following the Min-Max (MM) Trojan algorithm. We perform extensive evaluations of our algorithms on MNIST, CIFAR-10, and EuroSAT datasets. The results show that (i) with Submodular Trojan algorithm, the adversary needs to embed a Trojan trigger into a very small fraction of samples to achieve high accuracy on both Trojan and clean samples, and (ii) the MM Trojan algorithm yields a trained Trojan model that evades detection with probability 1.
    On NeuroSymbolic Solutions for PDEs. (arXiv:2207.06240v1 [cs.LG])
    Physics Informed Neural Networks (PINNs) have gained immense popularity as an alternate method for numerically solving PDEs. Despite their empirical success we are still building an understanding of the convergence properties of training on such constraints with gradient descent. It is known that, in the absence of an explicit inductive bias, Neural Networks can struggle to learn or approximate even simple and well known functions in a sample efficient manner. Thus the numerical approximation induced from few collocation points may not generalize over the entire domain. Meanwhile, a symbolic form can exhibit good generalization, with interpretability as a useful byproduct. However, symbolic approximations can struggle to simultaneously be concise and accurate. Therefore in this work we explore a NeuroSymbolic approach to approximate the solution for PDEs. We observe that our approach work for several simple cases. We illustrate the efficacy of our approach on Navier Stokes: Kovasznay flow where there are multiple physical quantities of interest governed with non-linear coupled PDE system. Domain splitting is now becoming a popular trick to help PINNs approximate complex functions. We observe that a NeuroSymbolic approach can help such complex functions as well. We demonstrate Domain-splitting assisted NeuroSymbolic approach on a temporally varying two-dimensional Burger's equation. Finally we consider the scenario where PINNs have to be solved for parameterized PDEs, for changing Initial-Boundary Conditions and changes in the coefficient of the PDEs. Hypernetworks have shown to hold promise to overcome these challenges. We show that one can design Hyper-NeuroSymbolic Networks which can combine the benefits of speed and increased accuracy. We observe that that the NeuroSymbolic approximations are consistently 1-2 order of magnitude better than just the neural or symbolic approximations.
    Is Appearance Free Action Recognition Possible?. (arXiv:2207.06261v1 [cs.CV])
    Intuition might suggest that motion and dynamic information are key to video-based action recognition. In contrast, there is evidence that state-of-the-art deep-learning video understanding architectures are biased toward static information available in single frames. Presently, a methodology and corresponding dataset to isolate the effects of dynamic information in video are missing. Their absence makes it difficult to understand how well contemporary architectures capitalize on dynamic vs. static information. We respond with a novel Appearance Free Dataset (AFD) for action recognition. AFD is devoid of static information relevant to action recognition in a single frame. Modeling of the dynamics is necessary for solving the task, as the action is only apparent through consideration of the temporal dimension. We evaluated 11 contemporary action recognition architectures on AFD as well as its related RGB video. Our results show a notable decrease in performance for all architectures on AFD compared to RGB. We also conducted a complimentary study with humans that shows their recognition accuracy on AFD and RGB is very similar and much better than the evaluated architectures on AFD. Our results motivate a novel architecture that revives explicit recovery of optical flow, within a contemporary design for best performance on AFD and RGB.
    dpart: Differentially Private Autoregressive Tabular, a General Framework for Synthetic Data Generation. (arXiv:2207.05810v1 [cs.LG])
    We propose a general, flexible, and scalable framework dpart, an open source Python library for differentially private synthetic data generation. Central to the approach is autoregressive modelling -- breaking the joint data distribution to a sequence of lower-dimensional conditional distributions, captured by various methods such as machine learning models (logistic/linear regression, decision trees, etc.), simple histogram counts, or custom techniques. The library has been created with a view to serve as a quick and accessible baseline as well as to accommodate a wide audience of users, from those making their first steps in synthetic data generation, to more experienced ones with domain expertise who can configure different aspects of the modelling and contribute new methods/mechanisms. Specific instances of dpart include Independent, an optimized version of PrivBayes, and a newly proposed model, dp-synthpop. Code: https://github.com/hazy/dpart
    Exploiting Social Graph Networks for Emotion Prediction. (arXiv:2207.05820v1 [cs.SI])
    Emotion prediction plays an essential role in mental health and emotion-aware computing. The complex nature of emotion resulting from its dependency on a person's physiological health, mental state, and his surroundings makes its prediction a challenging task. In this work, we utilize mobile sensing data to predict happiness and stress. In addition to a person's physiological features, we also incorporate the environment's impact through weather and social network. To this end, we leverage phone data to construct social networks and develop a machine learning architecture that aggregates information from multiple users of the graph network and integrates it with the temporal dynamics of data to predict emotion for all the users. The construction of social networks does not incur additional cost in terms of EMAs or data collection from users and doesn't raise privacy concerns. We propose an architecture that automates the integration of a user's social network affect prediction, is capable of dealing with the dynamic distribution of real-life social networks, making it scalable to large-scale networks. Our extensive evaluation highlights the improvement provided by the integration of social networks. We further investigate the impact of graph topology on model's performance.
    Enhanced Security and Privacy via Fragmented Federated Learning. (arXiv:2207.05978v1 [cs.CR])
    In federated learning (FL), a set of participants share updates computed on their local data with an aggregator server that combines updates into a global model. However, reconciling accuracy with privacy and security is a challenge to FL. On the one hand, good updates sent by honest participants may reveal their private local information, whereas poisoned updates sent by malicious participants may compromise the model's availability and/or integrity. On the other hand, enhancing privacy via update distortion damages accuracy, whereas doing so via update aggregation damages security because it does not allow the server to filter out individual poisoned updates. To tackle the accuracy-privacy-security conflict, we propose {\em fragmented federated learning} (FFL), in which participants randomly exchange and mix fragments of their updates before sending them to the server. To achieve privacy, we design a lightweight protocol that allows participants to privately exchange and mix encrypted fragments of their updates so that the server can neither obtain individual updates nor link them to their originators. To achieve security, we design a reputation-based defense tailored for FFL that builds trust in participants and their mixed updates based on the quality of the fragments they exchange and the mixed updates they send. Since the exchanged fragments' parameters keep their original coordinates and attackers can be neutralized, the server can correctly reconstruct a global model from the received mixed updates without accuracy loss. Experiments on four real data sets show that FFL can prevent semi-honest servers from mounting privacy attacks, can effectively counter poisoning attacks and can keep the accuracy of the global model.
    Prediction of the motion of chest internal points using a recurrent neural network trained with real-time recurrent learning for latency compensation in lung cancer radiotherapy. (arXiv:2207.05951v1 [eess.IV])
    During the radiotherapy treatment of patients with lung cancer, the radiation delivered to healthy tissue around the tumor needs to be minimized, which is difficult because of respiratory motion and the latency of linear accelerator systems. In the proposed study, we first use the Lucas-Kanade pyramidal optical flow algorithm to perform deformable image registration of chest computed tomography scan images of four patients with lung cancer. We then track three internal points close to the lung tumor based on the previously computed deformation field and predict their position with a recurrent neural network (RNN) trained using real-time recurrent learning (RTRL) and gradient clipping. The breathing data is quite regular, sampled at approximately 2.5Hz, and includes artificial drift in the spine direction. The amplitude of the motion of the tracked points ranged from 12.0mm to 22.7mm. Finally, we propose a simple method for recovering and predicting 3D tumor images from the tracked points and the initial tumor image based on a linear correspondence model and Nadaraya-Watson non-linear regression. The root-mean-square error, maximum error, and jitter corresponding to the RNN prediction on the test set were smaller than the same performance measures obtained with linear prediction and least mean squares (LMS). In particular, the maximum prediction error associated with the RNN, equal to 1.51mm, is respectively 16.1% and 5.0% lower than the maximum error associated with linear prediction and LMS. The average prediction time per time step with RTRL is equal to 119ms, which is less than the 400ms marker position sampling time. The tumor position in the predicted images appears visually correct, which is confirmed by the high mean cross-correlation between the original and predicted images, equal to 0.955.
    Brick Tic-Tac-Toe: Exploring the Generalizability of AlphaZero to Novel Test Environments. (arXiv:2207.05991v1 [cs.LG])
    Traditional reinforcement learning (RL) environments typically are the same for both the training and testing phases. Hence, current RL methods are largely not generalizable to a test environment which is conceptually similar but different from what the method has been trained on, which we term the novel test environment. As an effort to push RL research towards algorithms which can generalize to novel test environments, we introduce the Brick Tic-Tac-Toe (BTTT) test bed, where the brick position in the test environment is different from that in the training environment. Using a round-robin tournament on the BTTT environment, we show that traditional RL state-search approaches such as Monte Carlo Tree Search (MCTS) and Minimax are more generalizable to novel test environments than AlphaZero is. This is surprising because AlphaZero has been shown to achieve superhuman performance in environments such as Go, Chess and Shogi, which may lead one to think that it performs well in novel test environments. Our results show that BTTT, though simple, is rich enough to explore the generalizability of AlphaZero. We find that merely increasing MCTS lookahead iterations was insufficient for AlphaZero to generalize to some novel test environments. Rather, increasing the variety of training environments helps to progressively improve generalizability across all possible starting brick configurations.
    Competition over data: how does data purchase affect users?. (arXiv:2201.10774v2 [cs.LG] UPDATED)
    As machine learning (ML) is deployed by many competing service providers, the underlying ML predictors also compete against each other, and it is increasingly important to understand the impacts and biases from such competition. In this paper, we study what happens when the competing predictors can acquire additional labeled data to improve their prediction quality. We introduce a new environment that allows ML predictors to use active learning algorithms to purchase labeled data within their budgets while competing against each other to attract users. Our environment models a critical aspect of data acquisition in competing systems which has not been well-studied before. We found that the overall performance of an ML predictor improves when predictors can purchase additional labeled data. Surprisingly, however, the quality that users experience -- i.e. the accuracy of the predictor selected by each user -- can decrease even as the individual predictors get better. We show that this phenomenon naturally arises due to a trade-off whereby competition pushes each predictor to specialize in a subset of the population while data purchase has the effect of making predictors more uniform. We support our findings with both experiments and theories.
    Conditional Energy-Based Models for Implicit Policies: The Gap between Theory and Practice. (arXiv:2207.05824v1 [cs.RO])
    We present our findings in the gap between theory and practice of using conditional energy-based models (EBM) as an implicit representation for behavior-cloned policies. We also clarify several subtle, and potentially confusing, details in previous work in an attempt to help future research in this area. We point out key differences between unconditional and conditional EBMs, and warn that blindly applying training methods for one to the other could lead to undesirable results that do not generalize well. Finally, we emphasize the importance of the Maximum Mutual Information principle as a necessary condition to achieve good generalization in conditional EBMs as implicit models for regression tasks.
    Towards A Holistic View of Bias in Machine Learning: Bridging Algorithmic Fairness and Imbalanced Learning. (arXiv:2207.06084v1 [cs.LG])
    Machine learning (ML) is playing an increasingly important role in rendering decisions that affect a broad range of groups in society. ML models inform decisions in criminal justice, the extension of credit in banking, and the hiring practices of corporations. This posits the requirement of model fairness, which holds that automated decisions should be equitable with respect to protected features (e.g., gender, race, or age) that are often under-represented in the data. We postulate that this problem of under-representation has a corollary to the problem of imbalanced data learning. This class imbalance is often reflected in both classes and protected features. For example, one class (those receiving credit) may be over-represented with respect to another class (those not receiving credit) and a particular group (females) may be under-represented with respect to another group (males). A key element in achieving algorithmic fairness with respect to protected groups is the simultaneous reduction of class and protected group imbalance in the underlying training data, which facilitates increases in both model accuracy and fairness. We discuss the importance of bridging imbalanced learning and group fairness by showing how key concepts in these fields overlap and complement each other; and propose a novel oversampling algorithm, Fair Oversampling, that addresses both skewed class distributions and protected features. Our method: (i) can be used as an efficient pre-processing algorithm for standard ML algorithms to jointly address imbalance and group equity; and (ii) can be combined with fairness-aware learning algorithms to improve their robustness to varying levels of class imbalance. Additionally, we take a step toward bridging the gap between fairness and imbalanced learning with a new metric, Fair Utility, that combines balanced accuracy with fairness.
    Employing Feature Selection Algorithms to Determine the Immune State of Mice with Rheumatoid Arthritis. (arXiv:2207.05882v1 [stat.ML])
    The immune response is a dynamic process by which the body determines whether an antigen is self or nonself. The state of this dynamic process is defined by the relative balance and population of inflammatory and regulatory actors which comprise this decision making process. The goal of immunotherapy as applied to, e.g. Rheumatoid Arthritis (RA), then, is to bias the immune state in favor of the regulatory actors - thereby shutting down autoimmune pathways in the response. While there are several known approaches to immunotherapy, the effectiveness of the therapy will depend on how this intervention alters the evolution of this state. Unfortunately, this process is determined not only by the dynamics of the process, but the state of the system at the time of intervention - a state which is difficult if not impossible to determine prior to application of the therapy.
    Online Active Regression. (arXiv:2207.05945v1 [cs.LG])
    Active regression considers a linear regression problem where the learner receives a large number of data points but can only observe a small number of labels. Since online algorithms can deal with incremental training data and take advantage of low computational cost, we consider an online extension of the active regression problem: the learner receives data points one by one and immediately decides whether it should collect the corresponding labels. The goal is to efficiently maintain the regression of received data points with a small budget of label queries. We propose novel algorithms for this problem under $\ell_p$ loss where $p\in[1,2]$. To achieve a $(1+\epsilon)$-approximate solution, our proposed algorithms only require $\tilde{\mathcal{O}}(\epsilon^{-2} d \log(n\kappa))$ queries of labels, where $n$ is the number of data points and $\kappa$ is a quantity, called the condition number, of the data points. The numerical results verify our theoretical results and show that our methods have comparable performance with offline active regression algorithms.
    Federated Learning for THz Channel Estimation. (arXiv:2207.06017v1 [eess.SP])
    This paper addresses two major challenges in terahertz (THz) channel estimation: the beam-split phenomenon, i.e., beam misalignment because of frequency-independent analog beamformers, and computational complexity because of the usage of ultra-massive number of antennas to compensate propagation losses. Data-driven techniques are known to mitigate the complexity of this problem but usually require the transmission of the datasets from the users to a central server entailing huge communications overhead. In this work, we employ federated learning (FL), wherein the users transmit only the model parameters instead of the whole dataset, for THz channel estimation to improve the communications-efficiency. In order to accurately estimate the channel despite beam-split, we propose a beamspace support alignment technique without requiring additional hardware. Compared to the previous works, our method provides higher channel estimation accuracy as well as approximately $68$ times lower communications overhead.
    Unsupervised Learning for Combinatorial Optimization with Principled Objective Design. (arXiv:2207.05984v1 [cs.LG])
    Using machine learning to solve combinatorial optimization (CO) problems is challenging, especially when the data is unlabeled. This work proposes an unsupervised learning framework for CO problems. Our framework follows a standard relaxation-plus-rounding approach and adopts neural networks to parameterize the relaxed solutions so that simple back-propagation can train the model end-to-end. Our key contribution is the observation that if the relaxed objective satisfies entry-wise concavity, a low optimization loss guarantees the quality of the final integral solutions. This observation significantly broadens the applicability of the previous framework inspired by Erdos' probabilistic method. In particular, this observation can guide the design of objective models in applications where the objectives are not given explicitly while requiring being modeled in prior. We evaluate our framework by solving a synthetic graph optimization problem, and two real-world applications including resource allocation in circuit design and approximate computing. Our framework largely outperforms the baselines based on na\"{i}ve relaxation, reinforcement learning, and Gumbel-softmax tricks.
    Automatic Differentiation: Theory and Practice. (arXiv:2207.06114v1 [cs.LG])
    We present the classical coordinate-free formalism for forward and backward mode ad in the real and complex setting. We show how to formally derive the forward and backward formulae for a number of matrix functions starting from basic principles.
    Compactly Restrictable Metric Policy Optimization Problems. (arXiv:2207.05850v1 [math.OC])
    We study policy optimization problems for deterministic Markov decision processes (MDPs) with metric state and action spaces, which we refer to as Metric Policy Optimization Problems (MPOPs). Our goal is to establish theoretical results on the well-posedness of MPOPs that can characterize practically relevant continuous control systems. To do so, we define a special class of MPOPs called Compactly Restrictable MPOPs (CR-MPOPs), which are flexible enough to capture the complex behavior of robotic systems but specific enough to admit solutions using dynamic programming methods such as value iteration. We show how to arrive at CR-MPOPs using forward-invariance. We further show that our theoretical results on CR-MPOPs can be used to characterize feedback linearizable control affine systems.
    Towards understanding how momentum improves generalization in deep learning. (arXiv:2207.05931v1 [cs.LG])
    Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well-understood that using momentum can lead to faster convergence rate in various settings, it has also been observed that momentum yields higher generalization. Prior work argue that momentum stabilizes the SGD noise during training and this leads to higher generalization. In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized. The key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin. Contrary to GD that memorizes the small margin data, GD+M still learns the feature in these data thanks to its historical gradients. Lastly, we empirically validate our theoretical findings.
    AdamNODEs: When Neural ODE Meets Adaptive Moment Estimation. (arXiv:2207.06066v1 [cs.LG])
    Recent work by Xia et al. leveraged the continuous-limit of the classical momentum accelerated gradient descent and proposed heavy-ball neural ODEs. While this model offers computational efficiency and high utility over vanilla neural ODEs, this approach often causes the overshooting of internal dynamics, leading to unstable training of a model. Prior work addresses this issue by using ad-hoc approaches, e.g., bounding the internal dynamics using specific activation functions, but the resulting models do not satisfy the exact heavy-ball ODE. In this work, we propose adaptive momentum estimation neural ODEs (AdamNODEs) that adaptively control the acceleration of the classical momentum-based approach. We find that its adjoint states also satisfy AdamODE and do not require ad-hoc solutions that the prior work employs. In evaluation, we show that AdamNODEs achieve the lowest training loss and efficacy over existing neural ODEs. We also show that AdamNODEs have better training stability than classical momentum-based neural ODEs. This result sheds some light on adapting the techniques proposed in the optimization community to improving the training and inference of neural ODEs further. Our code is available at https://github.com/pmcsh04/AdamNODE.
    Reachable Distance Function for KNN Classification. (arXiv:2103.09704v2 [cs.LG] CROSS LISTED)
    Distance function is a main metrics of measuring the affinity between two data points in machine learning. Extant distance functions often provide unreachable distance values in real applications. This can lead to incorrect measure of the affinity between data points. This paper proposes a reachable distance function for KNN classification. The reachable distance function is not a geometric direct-line distance between two data points. It gives a consideration to the class attribute of a training dataset when measuring the affinity between data points. Concretely speaking, the reachable distance between data points includes their class center distance and real distance. Its shape looks like "Z", and we also call it a Z distance function. In this way, the affinity between data points in the same class is always stronger than that in different classes. Or, the intraclass data points are always closer than those interclass data points. We evaluated the reachable distance with experiments, and demonstrated that the proposed distance function achieved better performance in KNN classification.
    Physics-Informed Neural Operators. (arXiv:2207.05748v1 [cs.LG])
    Standard neural networks can approximate general nonlinear operators, represented either explicitly by a combination of mathematical operators, e.g., in an advection-diffusion-reaction partial differential equation, or simply as a black box, e.g., a system-of-systems. The first neural operator was the Deep Operator Network (DeepONet), proposed in 2019 based on rigorous approximation theory. Since then, a few other less general operators have been published, e.g., based on graph neural networks or Fourier transforms. For black box systems, training of neural operators is data-driven only but if the governing equations are known they can be incorporated into the loss function during training to develop physics-informed neural operators. Neural operators can be used as surrogates in design problems, uncertainty quantification, autonomous systems, and almost in any application requiring real-time inference. Moreover, independently pre-trained DeepONets can be used as components of a complex multi-physics system by coupling them together with relatively light training. Here, we present a review of DeepONet, the Fourier neural operator, and the graph neural operator, as well as appropriate extensions with feature expansions, and highlight their usefulness in diverse applications in computational mechanics, including porous media, fluid mechanics, and solid mechanics.
    Learning Bellman Complete Representations for Offline Policy Evaluation. (arXiv:2207.05837v1 [cs.LG])
    We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations satisfying these conditions are given, with results being mostly theoretical in nature. In this work, we propose BCRL, which directly learns from data an approximately linear Bellman complete representation with good coverage. With this learned representation, we perform OPE using Least Square Policy Evaluation (LSPE) with linear functions in our learned representation. We present an end-to-end theoretical analysis, showing that our two-stage algorithm enjoys polynomial sample complexity provided some representation in the rich class considered is linear Bellman complete. Empirically, we extensively evaluate our algorithm on challenging, image-based continuous control tasks from the Deepmind Control Suite. We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL (e.g., CURL, SPR). BCRL achieve competitive OPE error with the state-of-the-art method Fitted Q-Evaluation (FQE), and beats FQE when evaluating beyond the initial state distribution. Our ablations show that both linear Bellman complete and coverage components of our method are crucial.
    Sequential Recommendation Model for Next Purchase Prediction. (arXiv:2207.06225v1 [cs.IR])
    Timeliness and contextual accuracy of recommendations are increasingly important when delivering contemporary digital marketing experiences. Conventional recommender systems (RS) suggest relevant but time-invariant items to users by accounting for their past purchases. These recommendations only map to customers' general preferences rather than a customer's specific needs immediately preceding a purchase. In contrast, RSs that consider the order of transactions, purchases, or experiences to measure evolving preferences can offer more salient and effective recommendations to customers: Sequential RSs not only benefit from a better behavioral understanding of a user's current needs but also better predictive power. In this paper, we demonstrate and rank the effectiveness of a sequential recommendation system by utilizing a production dataset of over 2.7 million credit card transactions for 46K cardholders. The method first employs an autoencoder on raw transaction data and submits observed transaction encodings to a GRU-based sequential model. The sequential model produces a MAP@1 metric of 47% on the out-of-sample test set, in line with existing research. We also discuss implications for embedding real-time predictions using the sequential RS into Nexus, a scalable, low-latency, event-based digital experience architecture.
    FedShuffle: Recipes for Better Use of Local Work in Federated Learning. (arXiv:2204.13169v2 [cs.LG] UPDATED)
    The practice of applying several local updates before aggregation across clients has been empirically shown to be a successful approach to overcoming the communication bottleneck in Federated Learning (FL). In this work, we propose a general recipe, FedShuffle, that better utilizes the local updates in FL, especially in the heterogeneous regime. Unlike many prior works, FedShuffle does not assume any uniformity in the number of updates per device. Our FedShuffle recipe comprises four simple-yet-powerful ingredients: 1) local shuffling of the data, 2) adjustment of the local learning rates, 3) update weighting, and 4) momentum variance reduction (Cutkosky and Orabona, 2019). We present a comprehensive theoretical analysis of FedShuffle and show that both theoretically and empirically, our approach does not suffer from the objective function mismatch that is present in FL methods which assume homogeneous updates in heterogeneous FL setups, e.g., FedAvg (McMahan et al., 2017). In addition, by combining the ingredients above, FedShuffle improves upon FedNova (Wang et al., 2020), which was previously proposed to solve this mismatch. We also show that FedShuffle with momentum variance reduction can improve upon non-local methods under a Hessian similarity assumption. Finally, through experiments on synthetic and real-world datasets, we illustrate how each of the four ingredients used in FedShuffle helps improve the use of local updates in FL.
    Real-Time Intermediate Flow Estimation for Video Frame Interpolation. (arXiv:2011.06294v12 [cs.CV] UPDATED)
    Real-time video frame interpolation (VFI) is very useful in video processing, media players, and display devices. We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for VFI. To realize a high-quality flow-based VFI method, RIFE uses a neural network named IFNet that can estimate the intermediate flows end-to-end with much faster speed. A privileged distillation scheme is designed for stable IFNet training and improve the overall performance. RIFE does not rely on pre-trained optical flow models and can support arbitrary-timestep frame interpolation with the temporal encoding input. Experiments demonstrate that RIFE achieves state-of-the-art performance on several public benchmarks. Compared with the popular SuperSlomo and DAIN methods, RIFE is 4--27 times faster and produces better results. Furthermore, RIFE can be extended to wider applications thanks to temporal encoding. The code is available at https://github.com/megvii-research/ECCV2022-RIFE.
    Contextual Decision Trees. (arXiv:2207.06355v1 [stat.ML])
    Focusing on Random Forests, we propose a multi-armed contextual bandit recommendation framework for feature-based selection of a single shallow tree of the learned ensemble. The trained system, which works on top of the Random Forest, dynamically identifies a base predictor that is responsible for providing the final output. In this way, we obtain local interpretations by observing the rules of the recommended tree. The carried out experiments reveal that our dynamic method is superior to an independent fitted CART decision tree and comparable to the whole black-box Random Forest in terms of predictive performances.
    Object Detection as Probabilistic Set Prediction. (arXiv:2203.07980v3 [cs.CV] UPDATED)
    Accurate uncertainty estimates are essential for deploying deep object detectors in safety-critical systems. The development and evaluation of probabilistic object detectors have been hindered by shortcomings in existing performance measures, which tend to involve arbitrary thresholds or limit the detector's choice of distributions. In this work, we propose to view object detection as a set prediction task where detectors predict the distribution over the set of objects. Using the negative log-likelihood for random finite sets, we present a proper scoring rule for evaluating and training probabilistic object detectors. The proposed method can be applied to existing probabilistic detectors, is free from thresholds, and enables fair comparison between architectures. Three different types of detectors are evaluated on the COCO dataset. Our results indicate that the training of existing detectors is optimized toward non-probabilistic metrics. We hope to encourage the development of new object detectors that can accurately estimate their own uncertainty. Code available at https://github.com/georghess/pmb-nll.
    A Word is Worth A Thousand Dollars: Adversarial Attack on Tweets Fools Stock Predictions. (arXiv:2205.01094v3 [cs.CR] UPDATED)
    More and more investors and machine learning models rely on social media (e.g., Twitter and Reddit) to gather real-time information and sentiment to predict stock price movements. Although text-based models are known to be vulnerable to adversarial attacks, whether stock prediction models have similar vulnerability is underexplored. In this paper, we experiment with a variety of adversarial attack configurations to fool three stock prediction victim models. We address the task of adversarial generation by solving combinatorial optimization problems with semantics and budget constraints. Our results show that the proposed attack method can achieve consistent success rates and cause significant monetary loss in trading simulation by simply concatenating a perturbed but semantically similar tweet.
    (Nearly) Optimal Private Linear Regression via Adaptive Clipping. (arXiv:2207.04686v2 [cs.LG] UPDATED)
    We study the problem of differentially private linear regression where each data point is sampled from a fixed sub-Gaussian style distribution. We propose and analyze a one-pass mini-batch stochastic gradient descent method (DP-AMBSSGD) where points in each iteration are sampled without replacement. Noise is added for DP but the noise standard deviation is estimated online. Compared to existing $(\epsilon, \delta)$-DP techniques which have sub-optimal error bounds, DP-AMBSSGD is able to provide nearly optimal error bounds in terms of key parameters like dimensionality $d$, number of points $N$, and the standard deviation $\sigma$ of the noise in observations. For example, when the $d$-dimensional covariates are sampled i.i.d. from the normal distribution, then the excess error of DP-AMBSSGD due to privacy is $\frac{\sigma^2 d}{N}(1+\frac{d}{\epsilon^2 N})$, i.e., the error is meaningful when number of samples $N= \Omega(d \log d)$ which is the standard operative regime for linear regression. In contrast, error bounds for existing efficient methods in this setting are: $\mathcal{O}\big(\frac{d^3}{\epsilon^2 N^2}\big)$, even for $\sigma=0$. That is, for constant $\epsilon$, the existing techniques require $N=\Omega(d\sqrt{d})$ to provide a non-trivial result.
    RcTorch: a PyTorch Reservoir Computing Package with Automated Hyper-Parameter Optimization. (arXiv:2207.05870v1 [cs.LG])
    Reservoir computers (RCs) are among the fastest to train of all neural networks, especially when they are compared to other recurrent neural networks. RC has this advantage while still handling sequential data exceptionally well. However, RC adoption has lagged other neural network models because of the model's sensitivity to its hyper-parameters (HPs). A modern unified software package that automatically tunes these parameters is missing from the literature. Manually tuning these numbers is very difficult, and the cost of traditional grid search methods grows exponentially with the number of HPs considered, discouraging the use of the RC and limiting the complexity of the RC models which can be devised. We address these problems by introducing RcTorch, a PyTorch based RC neural network package with automated HP tuning. Herein, we demonstrate the utility of RcTorch by using it to predict the complex dynamics of a driven pendulum being acted upon by varying forces. This work includes coding examples. Example Python Jupyter notebooks can be found on our GitHub repository https://github.com/blindedjoy/RcTorch and documentation can be found at https://rctorch.readthedocs.io/.
    Simulation-guided Beam Search for Neural Combinatorial Optimization. (arXiv:2207.06190v1 [cs.LG])
    Neural approaches for combinatorial optimization (CO) equip a learning mechanism to discover powerful heuristics for solving complex real-world problems. While neural approaches capable of high-quality solutions in a single shot are emerging, state-of-the-art approaches are often unable to take full advantage of the solving time available to them. In contrast, hand-crafted heuristics perform highly effective search well and exploit the computation time given to them, but contain heuristics that are difficult to adapt to a dataset being solved. With the goal of providing a powerful search procedure to neural CO approaches, we propose simulation-guided beam search (SGBS), which examines candidate solutions within a fixed-width tree search that both a neural net-learned policy and a simulation (rollout) identify as promising. We further hybridize SGBS with efficient active search (EAS), where SGBS enhances the quality of solutions backpropagated in EAS, and EAS improves the quality of the policy used in SGBS. We evaluate our methods on well-known CO benchmarks and show that SGBS significantly improves the quality of the solutions found under reasonable runtime assumptions.
    Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image Translation. (arXiv:2204.11018v2 [cs.CV] UPDATED)
    Unpaired image-to-image translation aims to find a mapping between the source domain and the target domain. To alleviate the problem of the lack of supervised labels for the source images, cycle-consistency based methods have been proposed for image structure preservation by assuming a reversible relationship between unpaired images. However, this assumption only uses limited correspondence between image pairs. Recently, contrastive learning (CL) has been used to further investigate the image correspondence in unpaired image translation by using patch-based positive/negative learning. Patch-based contrastive routines obtain the positives by self-similarity computation and recognize the rest patches as negatives. This flexible learning paradigm obtains auxiliary contextualized information at a low cost. As the negatives own an impressive sample number, with curiosity, we make an investigation based on a question: are all negatives necessary for feature contrastive learning? Unlike previous CL approaches that use negatives as much as possible, in this paper, we study the negatives from an information-theoretic perspective and introduce a new negative Pruning technology for Unpaired image-to-image Translation (PUT) by sparsifying and ranking the patches. The proposed algorithm is efficient, flexible and enables the model to learn essential information between corresponding patches stably. By putting quality over quantity, only a few negative patches are required to achieve better results. Lastly, we validate the superiority, stability, and versatility of our model through comparative experiments.
    Job Offers Classifier using Neural Networks and Oversampling Methods. (arXiv:2207.06223v1 [cs.IR])
    Both policy and research benefit from a better understanding of individuals' jobs. However, as large-scale administrative records are increasingly employed to represent labor market activity, new automatic methods to classify jobs will become necessary. We developed an automatic job offers classifier using a dataset collected from the largest job bank of Mexico known as Bumeran https://www.bumeran.com.mx/ Last visited: 19-01-2022.. We applied machine learning algorithms such as Support Vector Machines, Naive-Bayes, Logistic Regression, Random Forest, and deep learning Long-Short Term Memory (LSTM). Using these algorithms, we trained multi-class models to classify job offers in one of the 23 classes (not uniformly distributed): Sales, Administration, Call Center, Technology, Trades, Human Resources, Logistics, Marketing, Health, Gastronomy, Financing, Secretary, Production, Engineering, Education, Design, Legal, Construction, Insurance, Communication, Management, Foreign Trade, and Mining. We used the SMOTE, Geometric-SMOTE, and ADASYN synthetic oversampling algorithms to handle imbalanced classes. The proposed convolutional neural network architecture achieved the best results when applied the Geometric-SMOTE algorithm.
    3D Concept Grounding on Neural Fields. (arXiv:2207.06403v1 [cs.CV])
    In this paper, we address the challenging problem of 3D concept grounding (i.e. segmenting and learning visual concepts) by looking at RGBD images and reasoning about paired questions and answers. Existing visual reasoning approaches typically utilize supervised methods to extract 2D segmentation masks on which concepts are grounded. In contrast, humans are capable of grounding concepts on the underlying 3D representation of images. However, traditionally inferred 3D representations (e.g., point clouds, voxelgrids, and meshes) cannot capture continuous 3D features flexibly, thus making it challenging to ground concepts to 3D regions based on the language description of the object being referred to. To address both issues, we propose to leverage the continuous, differentiable nature of neural fields to segment and learn concepts. Specifically, each 3D coordinate in a scene is represented as a high-dimensional descriptor. Concept grounding can then be performed by computing the similarity between the descriptor vector of a 3D coordinate and the vector embedding of a language concept, which enables segmentations and concept learning to be jointly learned on neural fields in a differentiable fashion. As a result, both 3D semantic and instance segmentations can emerge directly from question answering supervision using a set of defined neural operators on top of neural fields (e.g., filtering and counting). Experimental results show that our proposed framework outperforms unsupervised/language-mediated segmentation models on semantic and instance segmentation tasks, as well as outperforms existing models on the challenging 3D aware visual reasoning tasks. Furthermore, our framework can generalize well to unseen shape categories and real scans.
    Neural Topological Ordering for Computation Graphs. (arXiv:2207.05899v1 [cs.LG])
    Recent works on machine learning for combinatorial optimization have shown that learning based approaches can outperform heuristic methods in terms of speed and performance. In this paper, we consider the problem of finding an optimal topological order on a directed acyclic graph with focus on the memory minimization problem which arises in compilers. We propose an end-to-end machine learning based approach for topological ordering using an encoder-decoder framework. Our encoder is a novel attention based graph neural network architecture called \emph{Topoformer} which uses different topological transforms of a DAG for message passing. The node embeddings produced by the encoder are converted into node priorities which are used by the decoder to generate a probability distribution over topological orders. We train our model on a dataset of synthetically generated graphs called layered graphs. We show that our model outperforms, or is on-par, with several topological ordering baselines while being significantly faster on synthetic graphs with up to 2k nodes. We also train and test our model on a set of real-world computation graphs, showing performance improvements.
    HiClass: a Python library for local hierarchical classification compatible with scikit-learn. (arXiv:2112.06560v5 [cs.LG] UPDATED)
    HiClass is an open-source Python library for local hierarchical classification entirely compatible with scikit-learn. It contains implementations of the most common design patterns for hierarchical machine learning models found in the literature, i.e., the local classifiers per node, per parent node and per level. Additionally, the package contains implementations of hierarchical metrics, which are more appropriate for evaluating classification performance on hierarchical data. The documentation includes installation and usage instructions, examples within tutorials and interactive notebooks, and a complete description of the API. HiClass is released under the simplified BSD license, encouraging its use in both academic and commercial environments. Source code and documentation are available at https://github.com/mirand863/hiclass.
    Information-theoretic Inducing Point Placement for High-throughput Bayesian Optimisation. (arXiv:2206.02437v2 [cs.LG] UPDATED)
    Sparse Gaussian Processes are a key component of high-throughput Bayesian optimisation (BO) loops -- an increasingly common setting where evaluation budgets are large and highly parallelised. By using representative subsets of the available data to build approximate posteriors, sparse models dramatically reduce the computational costs of surrogate modelling by relying on a small set of pseudo-observations, the so-called inducing points, in lieu of the full data set. However, current approaches to design inducing points are not appropriate within BO loops as they seek to reduce global uncertainty in the objective function. Thus, the high-fidelity modelling of promising and data-dense regions required for precise optimisation is sacrificed and computational resources are instead wasted on modelling areas of the space already known to be sub-optimal. Inspired by entropy-based BO methods, we propose a novel inducing point design that uses a principled information-theoretic criterion to select inducing points. By choosing inducing points to maximally reduce both global uncertainty and uncertainty in the maximum value of the objective function, we build surrogate models able to support high-precision high-throughput BO.
    D-CBRS: Accounting For Intra-Class Diversity in Continual Learning. (arXiv:2207.05897v1 [cs.LG])
    Continual learning -- accumulating knowledge from a sequence of learning experiences -- is an important yet challenging problem. In this paradigm, the model's performance for previously encountered instances may substantially drop as additional data are seen. When dealing with class-imbalanced data, forgetting is further exacerbated. Prior work has proposed replay-based approaches which aim at reducing forgetting by intelligently storing instances for future replay. Although Class-Balancing Reservoir Sampling (CBRS) has been successful in dealing with imbalanced data, the intra-class diversity has not been accounted for, implicitly assuming that each instance of a class is equally informative. We present Diverse-CBRS (D-CBRS), an algorithm that allows us to consider within class diversity when storing instances in the memory. Our results show that D-CBRS outperforms state-of-the-art memory management continual learning algorithms on data sets with considerable intra-class diversity.
    DeepTIMe: Deep Time-Index Meta-Learning for Non-Stationary Time-Series Forecasting. (arXiv:2207.06046v1 [cs.LG])
    Deep learning has been actively applied to time-series forecasting, leading to a deluge of new autoregressive model architectures. Yet, despite the attractive properties of time-index based models, such as being a continuous signal function over time leading to smooth representations, little attention has been given to them. Indeed, while naive deep time-index based models are far more expressive than the manually predefined function representations of classical time-index based models, they are inadequate for forecasting due to the lack of inductive biases, and the non-stationarity of time-series. In this paper, we propose DeepTIMe, a deep time-index based model trained via a meta-learning formulation which overcomes these limitations, yielding an efficient and accurate forecasting model. Extensive experiments on real world datasets demonstrate that our approach achieves competitive results with state-of-the-art methods, and is highly efficient. Code is available at https://github.com/salesforce/DeepTIMe.
    Exploration in Deep Reinforcement Learning: A Comprehensive Survey. (arXiv:2109.06668v4 [cs.AI] UPDATED)
    Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant successes across a wide range of domains, including game AI, autonomous vehicles, robotics, and so on. However, DRL and deep MARL agents are widely known to be sample inefficient that millions of interactions are usually needed even for relatively simple problem settings, thus preventing the wide application and deployment in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how efficiently exploring the environment and collecting informative experiences that could benefit policy learning towards the optimal ones. This problem becomes more challenging in complex environments with sparse rewards, noisy distractions, long horizons, and non-stationary co-learners. In this paper, we conduct a comprehensive survey on existing exploration methods for both single-agent and multi-agent RL. We start the survey by identifying several key challenges to efficient exploration. Beyond the above two main branches, we also include other notable exploration methods with different ideas and techniques. In addition to algorithmic analysis, we provide a comprehensive and unified empirical comparison of different exploration methods for DRL on a set of commonly used benchmarks. According to our algorithmic and empirical investigation, we finally summarize the open problems of exploration in DRL and deep MARL and point out a few future directions.
    GraphMAE: Self-Supervised Masked Graph Autoencoders. (arXiv:2205.10803v3 [cs.LG] UPDATED)
    Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL has seen emerging success in natural language processing and other AI fields, such as the wide adoption of BERT and GPT. Despite this, contrastive learning-which heavily relies on structural data augmentation and complicated training strategies-has been the dominant approach in graph SSL, while the progress of generative SSL on graphs, especially graph autoencoders (GAEs), has thus far not reached the potential as promised in other fields. In this paper, we identify and examine the issues that negatively impact the development of GAEs, including their reconstruction objective, training robustness, and error metric. We present a masked graph autoencoder GraphMAE that mitigates these issues for generative self-supervised graph pretraining. Instead of reconstructing graph structures, we propose to focus on feature reconstruction with both a masking strategy and scaled cosine error that benefit the robust training of GraphMAE. We conduct extensive experiments on 21 public datasets for three different graph learning tasks. The results manifest that GraphMAE-a simple graph autoencoder with careful designs-can consistently generate outperformance over both contrastive and generative state-of-the-art baselines. This study provides an understanding of graph autoencoders and demonstrates the potential of generative self-supervised pre-training on graphs.
    FD-GATDR: A Federated-Decentralized-Learning Graph Attention Network for Doctor Recommendation Using EHR. (arXiv:2207.05750v1 [cs.IR])
    In the past decade, with the development of big data technology, an increasing amount of patient information has been stored as electronic health records (EHRs). Leveraging these data, various doctor recommendation systems have been proposed. Typically, such studies process the EHR data in a flat-structured manner, where each encounter was treated as an unordered set of features. Nevertheless, the heterogeneous structured information such as service sequence stored in claims shall not be ignored. This paper presents a doctor recommendation system with time embedding to reconstruct the potential connections between patients and doctors using heterogeneous graph attention network. Besides, to address the privacy issue of patient data sharing crossing hospitals, a federated decentralized learning method based on a minimization optimization model is also proposed. The graph-based recommendation system has been validated on a EHR dataset. Compared to baseline models, the proposed method improves the AUC by up to 6.2%. And our proposed federated-based algorithm not only yields the fictitious fusion center's performance but also enjoys a convergence rate of O(1/T).
    Reinforcement Learning Assisted Recursive QAOA. (arXiv:2207.06294v1 [quant-ph])
    Variational quantum algorithms such as the Quantum Approximation Optimization Algorithm (QAOA) in recent years have gained popularity as they provide the hope of using NISQ devices to tackle hard combinatorial optimization problems. It is, however, known that at low depth, certain locality constraints of QAOA limit its performance. To go beyond these limitations, a non-local variant of QAOA, namely recursive QAOA (RQAOA), was proposed to improve the quality of approximate solutions. The RQAOA has been studied comparatively less than QAOA, and it is less understood, for instance, for what family of instances it may fail to provide high quality solutions. However, as we are tackling $\mathsf{NP}$-hard problems (specifically, the Ising spin model), it is expected that RQAOA does fail, raising the question of designing even better quantum algorithms for combinatorial optimization. In this spirit, we identify and analyze cases where RQAOA fails and, based on this, propose a reinforcement learning enhanced RQAOA variant (RL-RQAOA) that improves upon RQAOA. We show that the performance of RL-RQAOA improves over RQAOA: RL-RQAOA is strictly better on these identified instances where RQAOA underperforms, and is similarly performing on instances where RQAOA is near-optimal. Our work exemplifies the potentially beneficial synergy between reinforcement learning and quantum (inspired) optimization in the design of new, even better heuristics for hard problems.
    Goal-Oriented Sensitivity Analysis of Hyperparameters in Deep Learning. (arXiv:2207.06216v1 [stat.ML])
    Tackling new machine learning problems with neural networks always means optimizing numerous hyperparameters that define their structure and strongly impact their performances. In this work, we study the use of goal-oriented sensitivity analysis, based on the Hilbert-Schmidt Independence Criterion (HSIC), for hyperparameter analysis and optimization. Hyperparameters live in spaces that are often complex and awkward. They can be of different natures (categorical, discrete, boolean, continuous), interact, and have inter-dependencies. All this makes it non-trivial to perform classical sensitivity analysis. We alleviate these difficulties to obtain a robust analysis index that is able to quantify hyperparameters' relative impact on a neural network's final error. This valuable tool allows us to better understand hyperparameters and to make hyperparameter optimization more interpretable. We illustrate the benefits of this knowledge in the context of hyperparameter optimization and derive an HSIC-based optimization algorithm that we apply on MNIST and Cifar, classical machine learning data sets, but also on the approximation of Runge function and Bateman equations solution, of interest for scientific machine learning. This method yields neural networks that are both competitive and cost-effective.
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v2 [stat.ML] UPDATED)
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.
    DiverGet: A Search-Based Software Testing Approach for Deep Neural Network Quantization Assessment. (arXiv:2207.06282v1 [cs.LG])
    Quantization is one of the most applied Deep Neural Network (DNN) compression strategies, when deploying a trained DNN model on an embedded system or a cell phone. This is owing to its simplicity and adaptability to a wide range of applications and circumstances, as opposed to specific Artificial Intelligence (AI) accelerators and compilers that are often designed only for certain specific hardware (e.g., Google Coral Edge TPU). With the growing demand for quantization, ensuring the reliability of this strategy is becoming a critical challenge. Traditional testing methods, which gather more and more genuine data for better assessment, are often not practical because of the large size of the input space and the high similarity between the original DNN and its quantized counterpart. As a result, advanced assessment strategies have become of paramount importance. In this paper, we present DiverGet, a search-based testing framework for quantization assessment. DiverGet defines a space of metamorphic relations that simulate naturally-occurring distortions on the inputs. Then, it optimally explores these relations to reveal the disagreements among DNNs of different arithmetic precision. We evaluate the performance of DiverGet on state-of-the-art DNNs applied to hyperspectral remote sensing images. We chose the remote sensing DNNs as they're being increasingly deployed at the edge (e.g., high-lift drones) in critical domains like climate change research and astronomy. Our results show that DiverGet successfully challenges the robustness of established quantization techniques against naturally-occurring shifted data, and outperforms its most recent concurrent, DiffChaser, with a success rate that is (on average) four times higher.
    Does GNN Pretraining Help Molecular Representation?. (arXiv:2207.06010v1 [cs.LG])
    Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, in deciding the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not have statistically significant advantages over non-pretraining methods in many settings. Second, although improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Third, experimental hyper-parameters have a larger impact on accuracy of downstream tasks than the choice of pretraining tasks. We hypothesize the complexity of pretraining on molecules is insufficient, leading to less transferable knowledge for downstream tasks.
    OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses. (arXiv:2204.02426v4 [cs.LG] UPDATED)
    Dataset bias and spurious correlations can significantly impair generalization in deep neural networks. Many prior efforts have addressed this problem using either alternative loss functions or sampling strategies that focus on rare patterns. We propose a new direction: modifying the network architecture to impose inductive biases that make the network robust to dataset bias. Specifically, we propose OccamNets, which are biased to favor simpler solutions by design. OccamNets have two inductive biases. First, they are biased to use as little network depth as needed for an individual example. Second, they are biased toward using fewer image locations for prediction. While OccamNets are biased toward simpler hypotheses, they can learn more complex hypotheses if necessary. In experiments, OccamNets outperform or rival state-of-the-art methods run on architectures that do not incorporate these inductive biases. Furthermore, we demonstrate that when the state-of-the-art debiasing methods are combined with OccamNets results further improve.
    Implicit Neural Representations for Generative Modeling of Living Cell Shapes. (arXiv:2207.06283v1 [cs.CV])
    Methods allowing the synthesis of realistic cell shapes could help generate training data sets to improve cell tracking and segmentation in biomedical images. Deep generative models for cell shape synthesis require a light-weight and flexible representation of the cell shape. However, commonly used voxel-based representations are unsuitable for high-resolution shape synthesis, and polygon meshes have limitations when modeling topology changes such as cell growth or mitosis. In this work, we propose to use level sets of signed distance functions (SDFs) to represent cell shapes. We optimize a neural network as an implicit neural representation of the SDF value at any point in a 3D+time domain. The model is conditioned on a latent code, thus allowing the synthesis of new and unseen shape sequences. We validate our approach quantitatively and qualitatively on C. elegans cells that grow and divide, and lung cancer cells with growing complex filopodial protrusions. Our results show that shape descriptors of synthetic cells resemble those of real cells, and that our model is able to generate topologically plausible sequences of complex cell shapes in 3D+time.
    Forecasting COVID-19 spreading trough an ensemble of classical and machine learning models: Spain's case study. (arXiv:2207.05753v1 [cs.LG])
    In this work we evaluate the applicability of an ensemble of population models and machine learning models to predict the near future evolution of the COVID-19 pandemic, with a particular use case in Spain. We rely solely in open and public datasets, fusing incidence, vaccination, human mobility and weather data to feed our machine learning models (Random Forest, Gradient Boosting, k-Nearest Neighbours and Kernel Ridge Regression). We use the incidence data to adjust classic population models (Gompertz, Logistic, Richards, Bertalanffy) in order to be able to better capture the trend of the data. We then ensemble these two families of models in order to obtain a more robust and accurate prediction. Furthermore, we have observed an improvement in the predictions obtained with machine learning models as we add new features (vaccines, mobility, climatic conditions), analyzing the importance of each of them using Shapley Additive Explanation values. As in any other modelling work, data and predictions quality have several limitations and therefore they must be seen from a critical standpoint, as we discuss in the text. Our work concludes that the ensemble use of these models improves the individual predictions (using only machine learning models or only population models) and can be applied, with caution, in cases when compartmental models cannot be utilized due to the lack of relevant data.
    Universal expressiveness of variational quantum classifiers and quantum kernels for support vector machines. (arXiv:2207.05865v1 [quant-ph])
    Machine learning is considered to be one of the most promising applications of quantum computing. Therefore, the search for quantum advantage of the quantum analogues of machine learning models is a key research goal. Here, we show that variational quantum classifiers (VQC) and support vector machines with quantum kernels (QSVM) can solve a classification problem based on the k-Forrelation problem, which is known to be PromiseBQP-complete. Because the PromiseBQP complexity class includes all Bounded-Error Quantum Polynomial-Time (BQP) decision problems, our results imply that there exists a feature map and a quantum kernel that make VQC and QSVM efficient solvers for any BQP problem. This means that the feature map of VQC or the quantum kernel of QSVM can be designed to have quantum advantage for any classification problem that cannot be classically solved in polynomial time but contrariwise by a quantum computer.
    On Merging Feature Engineering and Deep Learning for Diagnosis, Risk-Prediction and Age Estimation Based on the 12-Lead ECG. (arXiv:2207.06096v1 [cs.LG])
    Objective: Machine learning techniques have been used extensively for 12-lead electrocardiogram (ECG) analysis. For physiological time series, deep learning (DL) superiority to feature engineering (FE) approaches based on domain knowledge is still an open question. Moreover, it remains unclear whether combining DL with FE may improve performance. Methods: We considered three tasks intending to address these research gaps: cardiac arrhythmia diagnosis (multiclass-multilabel classification), atrial fibrillation risk prediction (binary classification), and age estimation (regression). We used an overall dataset of 2.3M 12-lead ECG recordings to train the following models for each task: i) a random forest taking the FE as input was trained as a classical machine learning approach; ii) an end-to-end DL model; and iii) a merged model of FE+DL. Results: FE yielded comparable results to DL while necessitating significantly less data for the two classification tasks and it was outperformed by DL for the regression task. For all tasks, merging FE with DL did not improve performance over DL alone. Conclusion: We found that for traditional 12-lead ECG based diagnosis tasks DL did not yield a meaningful improvement over FE, while it improved significantly the nontraditional regression task. We also found that combining FE with DL did not improve over DL alone which suggests that the FE were redundant with the features learned by DL. Significance: Our findings provides important recommendations on what machine learning strategy and data regime to chose with respect to the task at hand for the development of new machine learning models based on the 12-lead ECG.
    Normalized gradient flow optimization in the training of ReLU artificial neural networks. (arXiv:2207.06246v1 [math.OC])
    The training of artificial neural networks (ANNs) is nowadays a highly relevant algorithmic procedure with many applications in science and industry. Roughly speaking, ANNs can be regarded as iterated compositions between affine linear functions and certain fixed nonlinear functions, which are usually multidimensional versions of a one-dimensional so-called activation function. The most popular choice of such a one-dimensional activation function is the rectified linear unit (ReLU) activation function which maps a real number to its positive part $ \mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R} $. In this article we propose and analyze a modified variant of the standard training procedure of such ReLU ANNs in the sense that we propose to restrict the negative gradient flow dynamics to a large submanifold of the ANN parameter space, which is a strict $ C^{ \infty } $-submanifold of the entire ANN parameter space that seems to enjoy better regularity properties than the entire ANN parameter space but which is also sufficiently large and sufficiently high dimensional so that it can represent all ANN realization functions that can be represented through the entire ANN parameter space. In the special situation of shallow ANNs with just one-dimensional ANN layers we also prove for every Lipschitz continuous target function that every gradient flow trajectory on this large submanifold of the ANN parameter space is globally bounded. For the standard gradient flow on the entire ANN parameter space with Lipschitz continuous target functions it remains an open problem of research to prove or disprove the global boundedness of gradient flow trajectories even in the situation of shallow ANNs with just one-dimensional ANN layers.
    Learning robust marking policies for adaptive mesh refinement. (arXiv:2207.06339v1 [math.NA])
    In this work, we revisit the marking decisions made in the standard adaptive finite element method (AFEM). Experience shows that a na\"{i}ve marking policy leads to inefficient use of computational resources for adaptive mesh refinement (AMR). Consequently, using AFEM in practice often involves ad-hoc or time-consuming offline parameter tuning to set appropriate parameters for the marking subroutine. To address these practical concerns, we recast AMR as a Markov decision process in which refinement parameters can be selected on-the-fly at run time, without the need for pre-tuning by expert users. In this new paradigm, the refinement parameters are also chosen adaptively via a marking policy that can be optimized using methods from reinforcement learning. We use the Poisson equation to demonstrate our techniques on $h$- and $hp$-refinement benchmark problems, and our experiments suggest that superior marking policies remain undiscovered for many classical AFEM applications. Furthermore, an unexpected observation from this work is that marking policies trained on one family of PDEs are sometimes robust enough to perform well on problems far outside the training family. For illustration, we show that a simple $hp$-refinement policy trained on 2D domains with only a single re-entrant corner can be deployed on far more complicated 2D domains, and even 3D domains, without significant performance loss. For reproduction and broader adoption, we accompany this work with an open-source implementation of our methods.
    Earthformer: Exploring Space-Time Transformers for Earth System Forecasting. (arXiv:2207.05833v1 [cs.LG])
    Conventionally, Earth system (e.g., weather and climate) forecasting relies on numerical simulation with complex physical models and are hence both expensive in computation and demanding on domain expertise. With the explosive growth of the spatiotemporal Earth observation data in the past decade, data-driven models that apply Deep Learning (DL) are demonstrating impressive potential for various Earth system forecasting tasks. The Transformer as an emerging DL architecture, despite its broad success in other domains, has limited adoption in this area. In this paper, we propose Earthformer, a space-time Transformer for Earth system forecasting. Earthformer is based on a generic, flexible and efficient space-time attention block, named Cuboid Attention. The idea is to decompose the data into cuboids and apply cuboid-level self-attention in parallel. These cuboids are further connected with a collection of global vectors. We conduct experiments on the MovingMNIST dataset and a newly proposed chaotic N-body MNIST dataset to verify the effectiveness of cuboid attention and figure out the best design of Earthformer. Experiments on two real-world benchmarks about precipitation nowcasting and El Nino/Southern Oscillation (ENSO) forecasting show Earthformer achieves state-of-the-art performance.
    Implicit regularization of dropout. (arXiv:2207.05952v1 [cs.LG])
    It is important to understand how the popular regularization method dropout helps the neural network training find a good generalization solution. In this work, we theoretically derive the implicit regularization of dropout and study the relation between the Hessian matrix of the loss function and the covariance matrix of the dropout noise, supported by a series of experiments. We then numerically study two implications of the implicit regularization of dropout, which intuitively rationalize why dropout helps generalization. First, we find that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training in experiments, and the implicit regularization is the key for finding flat solutions. Second, trained with dropout, input weights of hidden neurons (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) would tend to condense on isolated orientations. Condensation is a feature in non-linear learning process, which makes the neural network low complexity. Although our theory mainly focuses on the dropout used in the last hidden layer, our experiments apply for general dropout in training neural networks. This work points out the distinct characteristics of dropout compared with stochastic gradient descent and serves as an important basis for fully understanding dropout.
    Learning to Control Local Search for Combinatorial Optimization. (arXiv:2206.13181v2 [cs.LG] UPDATED)
    Combinatorial optimization problems are encountered in many practical contexts such as logistics and production, but exact solutions are particularly difficult to find and usually NP-hard for considerable problem sizes. To compute approximate solutions, a zoo of generic as well as problem-specific variants of local search is commonly used. However, which variant to apply to which particular problem is difficult to decide even for experts. In this paper we identify three independent algorithmic aspects of such local search algorithms and formalize their sequential selection over an optimization process as Markov Decision Process (MDP). We design a deep graph neural network as policy model for this MDP, yielding a learned controller for local search called NeuroLS. Ample experimental evidence shows that NeuroLS is able to outperform both, well-known general purpose local search controllers from Operations Research as well as latest machine learning-based approaches.
    Human-AI Collaboration in Decision-Making: Beyond Learning to Defer. (arXiv:2206.13202v2 [cs.LG] UPDATED)
    Human-AI collaboration (HAIC) in decision-making aims to create synergistic teaming between human decision-makers and AI systems. Learning to defer (L2D) has been presented as a promising framework to determine who among humans and AI should make which decisions in order to optimize the performance and fairness of the combined system. Nevertheless, L2D entails several often unfeasible requirements, such as the availability of predictions from humans for every instance or ground-truth labels that are independent from said humans. Furthermore, neither L2D nor alternative approaches tackle fundamental issues of deploying HAIC systems in real-world settings, such as capacity management or dealing with dynamic environments. In this paper, we aim to identify and review these and other limitations, pointing to where opportunities for future research in HAIC may lie.
    Robust Data-Driven Predictive Control using Reachability Analysis. (arXiv:2103.14110v3 [eess.SY] UPDATED)
    We present a robust data-driven control scheme for an unknown linear system model with bounded process and measurement noise. Instead of depending on a system model in traditional predictive control, a controller utilizing data-driven reachable regions is proposed. The data-driven reachable regions are based on a matrix zonotope recursion and are computed based on only noisy input-output data of a trajectory of the system. We assume that measurement and process noise are contained in bounded sets. While we assume knowledge of these bounds, no knowledge about the statistical properties of the noise is assumed. In the noise-free case, we prove that the presented purely data-driven control scheme results in an equivalent closed-loop behavior to a nominal model predictive control scheme. In the case of measurement and process noise, our proposed scheme guarantees robust constraint satisfaction, which is essential in safety-critical applications. Numerical experiments show the effectiveness of the proposed data-driven controller in comparison to model-based control schemes.
    Towards Highly Expressive Machine Learning Models of Non-Melanoma Skin Cancer. (arXiv:2207.05749v1 [cs.LG])
    Pathologists have a rich vocabulary with which they can describe all the nuances of cellular morphology. In their world, there is a natural pairing of images and words. Recent advances demonstrate that machine learning models can now be trained to learn high-quality image features and represent them as discrete units of information. This enables natural language, which is also discrete, to be jointly modelled alongside the imaging, resulting in a description of the contents of the imaging. Here we present experiments in applying discrete modelling techniques to the problem domain of non-melanoma skin cancer, specifically, histological images of Intraepidermal Carcinoma (IEC). Implementing a VQ-GAN model to reconstruct high-resolution (256x256) images of IEC images, we trained a sequence-to-sequence transformer to generate natural language descriptions using pathologist terminology. Combined with the idea of interactive concept vectors available by using continuous generative methods, we demonstrate an additional angle of interpretability. The result is a promising means of working towards highly expressive machine learning systems which are not only useful as predictive/classification tools, but also means to further our scientific understanding of disease.
    Optimistic PAC Reinforcement Learning: the Instance-Dependent View. (arXiv:2207.05852v1 [cs.LG])
    Optimistic algorithms have been extensively studied for regret minimization in episodic tabular MDPs, both from a minimax and an instance-dependent view. However, for the PAC RL problem, where the goal is to identify a near-optimal policy with high probability, little is known about their instance-dependent sample complexity. A negative result of Wagenmaker et al. (2021) suggests that optimistic sampling rules cannot be used to attain the (still elusive) optimal instance-dependent sample complexity. On the positive side, we provide the first instance-dependent bound for an optimistic algorithm for PAC RL, BPI-UCRL, for which only minimax guarantees were available (Kaufmann et al., 2021). While our bound features some minimal visitation probabilities, it also features a refined notion of sub-optimality gap compared to the value gaps that appear in prior work. Moreover, in MDPs with deterministic transitions, we show that BPI-UCRL is actually near-optimal. On the technical side, our analysis is very simple thanks to a new "target trick" of independent interest. We complement these findings with a novel hardness result explaining why the instance-dependent complexity of PAC RL cannot be easily related to that of regret minimization, unlike in the minimax regime.
    A Conceptual Framework for Using Machine Learning to Support Child Welfare Decisions. (arXiv:2207.05855v1 [cs.CY])
    Human services systems make key decisions that impact individuals in the society. The U.S. child welfare system makes such decisions, from screening-in hotline reports of suspected abuse or neglect for child protective investigations, placing children in foster care, to returning children to permanent home settings. These complex and impactful decisions on children's lives rely on the judgment of child welfare decisionmakers. Child welfare agencies have been exploring ways to support these decisions with empirical, data-informed methods that include machine learning (ML). This paper describes a conceptual framework for ML to support child welfare decisions. The ML framework guides how child welfare agencies might conceptualize a target problem that ML can solve; vet available administrative data for building ML; formulate and develop ML specifications that mirror relevant populations and interventions the agencies are undertaking; deploy, evaluate, and monitor ML as child welfare context, policy, and practice change over time. Ethical considerations, stakeholder engagement, and avoidance of common pitfalls underpin the framework's impact and success. From abstract to concrete, we describe one application of this framework to support a child welfare decision. This ML framework, though child welfare-focused, is generalizable to solving other public policy problems.
    Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. (arXiv:2202.06934v4 [cs.CV] UPDATED)
    Detection of small objects and objects far away in the scene is a major challenge in surveillance applications. Such objects are represented by small number of pixels in the image and lack sufficient details, making them difficult to detect using conventional detectors. In this work, an open-source framework called Slicing Aided Hyper Inference (SAHI) is proposed that provides a generic slicing aided inference and fine-tuning pipeline for small object detection. The proposed technique is generic in the sense that it can be applied on top of any available object detector without any fine-tuning. Experimental evaluations, using object detection baselines on the Visdrone and xView aerial object detection datasets show that the proposed inference method can increase object detection AP by 6.8%, 5.1% and 5.3% for FCOS, VFNet and TOOD detectors, respectively. Moreover, the detection accuracy can be further increased with a slicing aided fine-tuning, resulting in a cumulative increase of 12.7%, 13.4% and 14.5% AP in the same order. Proposed technique has been integrated with Detectron2, MMDetection and YOLOv5 models and it is publicly available at https://github.com/obss/sahi.git .
    RelaxLoss: Defending Membership Inference Attacks without Losing Utility. (arXiv:2207.05801v1 [cs.LG])
    As a long-term threat to the privacy of training data, membership inference attacks (MIAs) emerge ubiquitously in machine learning models. Existing works evidence strong connection between the distinguishability of the training and testing loss distributions and the model's vulnerability to MIAs. Motivated by existing results, we propose a novel training framework based on a relaxed loss with a more achievable learning target, which leads to narrowed generalization gap and reduced privacy leakage. RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead. Through extensive evaluations on five datasets with diverse modalities (images, medical data, transaction records), our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs as well as model utility. Our defense is the first that can withstand a wide range of attacks while preserving (or even improving) the target model's utility. Source code is available at https://github.com/DingfanChen/RelaxLoss
    Unsupervised Recognition of Informative Features via Tensor Network Machine Learning and Quantum Entanglement Variations. (arXiv:2207.06031v1 [quant-ph])
    Given an image of a white shoe drawn on a blackboard, how are the white pixels deemed (say by human minds) to be informative for recognizing the shoe without any labeling information on the pixels? Here we investigate such a "white shoe" recognition problem from the perspective of tensor network (TN) machine learning and quantum entanglement. Utilizing a generative TN that captures the probability distribution of the features as quantum amplitudes, we propose an unsupervised recognition scheme of informative features with the variations of entanglement entropy (EE) caused by designed measurements. In this way, a given sample, where the values of its features are statistically meaningless, is mapped to the variations of EE that are statistically meaningful. We show that the EE variations identify the features that are critical to recognize this specific sample, and the EE itself reveals the information distribution from the TN model. The signs of the variations further reveal the entanglement structures among the features. We test the validity of our scheme on a toy dataset of strip images, the MNIST dataset of hand-drawn digits, and the fashion-MNIST dataset of the pictures of fashion articles. Our scheme opens the avenue to the quantum-inspired and interpreted unsupervised learning and could be applied to, e.g., image segmentation and object detection.
    Online Decision Transformer. (arXiv:2202.05607v2 [cs.LG] UPDATED)
    Recent work has shown that offline reinforcement learning (RL) can be formulated as a sequence modeling problem (Chen et al., 2021; Janner et al., 2021) and solved via approaches similar to large-scale language modeling. However, any practical instantiation of RL also involves an online component, where policies pretrained on passive offline datasets are finetuned via taskspecific interactions with the environment. We propose Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework. Our framework uses sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning. Empirically, we show that ODT is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finetuning procedure.
    ConvGeN: Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets. (arXiv:2206.09812v2 [cs.LG] UPDATED)
    Data is commonly stored in tabular format. Several fields of research are prone to small imbalanced tabular data. Supervised Machine Learning on such data is often difficult due to class imbalance. Synthetic data generation, i.e., oversampling, is a common remedy used to improve classifier performance. State-of-the-art linear interpolation approaches, such as LoRAS and ProWRAS can be used to generate synthetic samples from the convex space of the minority class to improve classifier performance in such cases. Deep generative networks are common deep learning approaches for synthetic sample generation, widely used for synthetic image generation. However, their scope on synthetic tabular data generation in the context of imbalanced classification is not adequately explored. In this article, we show that existing deep generative models perform poorly compared to linear interpolation based approaches for imbalanced classification problems on smaller tabular datasets. To overcome this, we propose a deep generative model, ConvGeN that combines the idea of convex space learning with deep generative models. ConvGeN learns the coefficients for the convex combinations of the minority class samples, such that the synthetic data is distinct enough from the majority class. Our benchmarking experiments demonstrate that our proposed model ConvGeN improves imbalanced classification on such small datasets, as compared to existing deep generative models, while being at-par with the existing linear interpolation approaches. Moreover, we discuss how our model can be used for synthetic tabular data generation in general, even outside the scope of data imbalance and thus, improves the overall applicability of convex space learning.
    Collaboration-Aware Graph Convolutional Networks for Recommendation Systems. (arXiv:2207.06221v1 [cs.IR])
    By virtue of the message-passing that implicitly injects collaborative effect into the embedding process, Graph Neural Networks (GNNs) have been successfully adopted in recommendation systems. Nevertheless, most of existing message-passing mechanisms in recommendation are directly inherited from GNNs without any recommendation-tailored modification. Although some efforts have been made towards simplifying GNNs to improve the performance/efficiency of recommendation, no study has comprehensively scrutinized how message-passing captures collaborative effect and whether the captured effect would benefit the prediction of user preferences over items. Therefore, in this work we aim to demystify the collaborative effect captured by message-passing in GNNs and develop new insights towards customizing message-passing for recommendation. First, we theoretically analyze how message-passing captures and leverages the collaborative effect in predicting user preferences. Then, to determine whether the captured collaborative effect would benefit the prediction of user preferences, we propose a recommendation-oriented topological metric, Common Interacted Ratio (CIR), which measures the level of interaction between a specific neighbor of a node with the rest of its neighborhood set. Inspired by our theoretical and empirical analysis, we propose a recommendation-tailored GNN, Augmented Collaboration-Aware Graph Convolutional Network (CAGCN*), that extends upon the LightGCN framework and is able to selectively pass information of neighbors based on their CIR via the Collaboration-Aware Graph Convolution. Experimental results on six benchmark datasets show that CAGCN* outperforms the most representative GNN-based recommendation model, LightGCN, by 9% in Recall@20 and also achieves more than 79% speedup. Our code is publicly available at https://github.com/YuWVandy/CAGCN.
    A new hope for network model generalization. (arXiv:2207.05843v1 [cs.NI])
    Generalizing machine learning (ML) models for network traffic dynamics tends to be considered a lost cause. Hence, for every new task, we often resolve to design new models and train them on model-specific datasets collected, whenever possible, in an environment mimicking the model's deployment. This approach essentially gives up on generalization. Yet, an ML architecture called_Transformer_ has enabled previously unimaginable generalization in other domains. Nowadays, one can download a model pre-trained on massive datasets and only fine-tune it for a specific task and context with comparatively little time and data. These fine-tuned models are now state-of-the-art for many benchmarks. We believe this progress could translate to networking and propose a Network Traffic Transformer (NTT), a transformer adapted to learn network dynamics from packet traces. Our initial results are promising: NTT seems able to generalize to new prediction tasks and contexts. This study suggests there is still hope for generalization, though it calls for a lot of future research.
    Experiments on Anomaly Detection in Autonomous Driving by Forward-Backward Style Transfers. (arXiv:2207.06055v1 [cs.CV])
    Great progress has been achieved in the community of autonomous driving in the past few years. As a safety-critical problem, however, anomaly detection is a huge hurdle towards a large-scale deployment of autonomous vehicles in the real world. While many approaches, such as uncertainty estimation or segmentation-based image resynthesis, are extremely promising, there is more to be explored. Especially inspired by works on anomaly detection based on image resynthesis, we propose a novel approach for anomaly detection through style transfer. We leverage generative models to map an image from its original style domain of road traffic to an arbitrary one and back to generate pixelwise anomaly scores. However, our experiments have proven our hypothesis wrong, and we were unable to produce significant results. Nevertheless, we want to share our findings, so that others can learn from our experiments.
    OSLAT: Open Set Label Attention Transformer for Medical Entity Span Extraction. (arXiv:2207.05817v1 [cs.CL])
    Identifying spans in medical texts that correspond to medical entities is one of the core steps for many healthcare NLP tasks such as ICD coding, medical finding extraction, medical note contextualization, to name a few. Existing entity extraction methods rely on a fixed and limited vocabulary of medical entities and have difficulty with extracting entities represented by disjoint spans. In this paper, we present a new transformer-based architecture called OSLAT, Open Set Label Attention Transformer, that addresses many of the limitations of the previous methods. Our approach uses the label-attention mechanism to implicitly learn spans associated with entities of interest. These entities can be provided as free text, including entities not seen during OSLAT's training, and the model can extract spans even when they are disjoint. To test the generalizability of our method, we train two separate models on two different datasets, which have very low entity overlap: (1) a public discharge notes dataset from hNLP, and (2) a much more challenging proprietary patient text dataset "Reasons for Encounter" (RFE). We find that OSLAT models trained on either dataset outperform rule-based and fuzzy string matching baselines when applied to the RFE dataset as well as to the portion of hNLP dataset where entities are represented by disjoint spans. Our code can be found at https://github.com/curai/curai-research/tree/main/OSLAT.
    Towards Knowledge-based Mining of Mental Disorder Patterns from Textual Data. (arXiv:2207.06254v1 [cs.IR])
    Mental health disorders may cause severe consequences on all the countries' economies and health. For example, the impacts of the COVID-19 pandemic, such as isolation and travel ban, can make us feel depressed. Identifying early signs of mental health disorders is vital. For example, depression may increase an individual's risk of suicide. The state-of-the-art research in identifying mental disorder patterns from textual data, uses hand-labelled training sets, especially when a domain expert's knowledge is required to analyse various symptoms. This task could be time-consuming and expensive. To address this challenge, in this paper, we study and analyse the various clinical and non-clinical approaches to identifying mental health disorders. We leverage the domain knowledge and expertise in cognitive science to build a domain-specific Knowledge Base (KB) for the mental health disorder concepts and patterns. We present a weaker form of supervision by facilitating the generating of training data from a domain-specific Knowledge Base (KB). We adopt a typical scenario for analysing social media to identify major depressive disorder symptoms from the textual content generated by social users. We use this scenario to evaluate how our knowledge-based approach significantly improves the quality of results.
    Non-Myopic Multifidelity Bayesian Optimization. (arXiv:2207.06325v1 [cs.LG])
    Bayesian optimization is a popular framework for the optimization of black box functions. Multifidelity methods allows to accelerate Bayesian optimization by exploiting low-fidelity representations of expensive objective functions. Popular multifidelity Bayesian strategies rely on sampling policies that account for the immediate reward obtained evaluating the objective function at a specific input, precluding greater informative gains that might be obtained looking ahead more steps. This paper proposes a non-myopic multifidelity Bayesian framework to grasp the long-term reward from future steps of the optimization. Our computational strategy comes with a two-step lookahead multifidelity acquisition function that maximizes the cumulative reward obtained measuring the improvement in the solution over two steps ahead. We demonstrate that the proposed algorithm outperforms a standard multifidelity Bayesian framework on popular benchmark optimization problems.
    Efficient Adaptive Regret Minimization. (arXiv:2207.00646v2 [cs.LG] UPDATED)
    In online convex optimization the player aims to minimize her regret against a fixed comparator over the entire repeated game. Algorithms that minimize standard regret may converge to a fixed decision, which is undesireable in changing or dynamic environments. This motivates the stronger metric of adaptive regret, or the maximum regret over any continuous sub-interval in time. Existing adaptive regret algorithms suffer from a computational penalty - typically on the order of a multiplicative factor that grows logarithmically in the number of game iterations. In this paper we show how to reduce this computational penalty to be doubly logarithmic in the number of game iterations, and with minimal degradation to the optimal attainable adaptive regret bounds.
    Efficient and Scalable Recommendation via Item-Item Graph Partitioning. (arXiv:2207.05959v1 [cs.IR])
    Collaborative filtering (CF) is a widely searched problem in recommender systems. Linear autoencoder is a kind of well-established method for CF, which estimates item-item relations through encoding user-item interactions. Despite the excellent performance of linear autoencoders, the rapidly increasing computational and storage costs caused by the growing number of items limit their scalabilities in large-scale real-world scenarios. Recently, graph-based approaches have achieved success on CF with high scalability, and have been shown to have commonalities with linear autoencoders in user-item interaction modeling. Motivated by this, we propose an efficient and scalable recommendation via item-item graph partitioning (ERGP), aiming to address the limitations of linear autoencoders. In particular, a recursive graph partitioning strategy is proposed to ensure that the item set is divided into several partitions of finite size. Linear autoencoders encode user-item interactions within partitions while preserving global information across the entire item set. This allows ERGP to have guaranteed efficiency and high scalability when the number of items increases. Experiments conducted on 3 public datasets and 3 open benchmarking datasets demonstrate the effectiveness of ERGP, which outperforms state-of-the-art models with lower training time and storage costs.
    N-Grammer: Augmenting Transformers with latent n-grams. (arXiv:2207.06366v1 [cs.CL])
    Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there is significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We open-source our model for reproducibility purposes in Jax.
    Domain adaptation strategies for cancer-independent detection of lymph node metastases. (arXiv:2207.06193v1 [eess.IV])
    Recently, large, high-quality public datasets have led to the development of convolutional neural networks that can detect lymph node metastases of breast cancer at the level of expert pathologists. Many cancers, regardless of the site of origin, can metastasize to lymph nodes. However, collecting and annotating high-volume, high-quality datasets for every cancer type is challenging. In this paper we investigate how to leverage existing high-quality datasets most efficiently in multi-task settings for closely related tasks. Specifically, we will explore different training and domain adaptation strategies, including prevention of catastrophic forgetting, for colon and head-and-neck cancer metastasis detection in lymph nodes. Our results show state-of-the-art performance on both cancer metastasis detection tasks. Furthermore, we show the effectiveness of repeated adaptation of networks from one cancer type to another to obtain multi-task metastasis detection networks. Last, we show that leveraging existing high-quality datasets can significantly boost performance on new target tasks and that catastrophic forgetting can be effectively mitigated using regularization.
    BR-SNIS: Bias Reduced Self-Normalized Importance Sampling. (arXiv:2207.06364v1 [stat.ML])
    Importance Sampling (IS) is a method for approximating expectations under a target distribution using independent samples from a proposal distribution and the associated importance weights. In many applications, the target distribution is known only up to a normalization constant, in which case self-normalized IS (SNIS) can be used. While the use of self-normalization can have a positive effect on the dispersion of the estimator, it introduces bias. In this work, we propose a new method, BR-SNIS, whose complexity is essentially the same as that of SNIS and which significantly reduces bias without increasing the variance. This method is a wrapper in the sense that it uses the same proposal samples and importance weights as SNIS, but makes clever use of iterated sampling--importance resampling (ISIR) to form a bias-reduced version of the estimator. We furnish the proposed algorithm with rigorous theoretical results, including new bias, variance and high-probability bounds, and these are illustrated by numerical examples.
    Data-driven Control of Agent-based Models: an Equation/Variable-free Machine Learning Approach. (arXiv:2207.05779v1 [math.DS])
    We present an Equation/Variable free machine learning (EVFML) framework for the control of the collective dynamics of complex/multiscale systems modelled via microscopic/agent-based simulators. The approach obviates the need for construction of surrogate, reduced-order models.~The proposed implementation consists of three steps: (A) from high-dimensional agent-based simulations, machine learning (in particular, non-linear manifold learning (Diffusion Maps (DMs)) helps identify a set of coarse-grained variables that parametrize the low-dimensional manifold on which the emergent/collective dynamics evolve. The out-of-sample extension and pre-image problems, i.e. the construction of non-linear mappings from the high-dimensional input space to the low-dimensional manifold and back, are solved by coupling DMs with the Nystrom extension and Geometric Harmonics, respectively; (B) having identified the manifold and its coordinates, we exploit the Equation-free approach to perform numerical bifurcation analysis of the emergent dynamics; then (C) based on the previous steps, we design data-driven embedded wash-out controllers that drive the agent-based simulators to their intrinsic, imprecisely known, emergent open-loop unstable steady-states, thus demonstrating that the scheme is robust against numerical approximation errors and modelling uncertainty.~The efficiency of the framework is illustrated by controlling emergent unstable (i) traveling waves of a deterministic agent-based model of traffic dynamics, and (ii) equilibria of a stochastic financial market agent model with mimesis.
    Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers. (arXiv:2107.12636v3 [cs.CV] UPDATED)
    Detection transformers have recently shown promising object detection results and attracted increasing attention. However, how to develop effective domain adaptation techniques to improve its cross-domain performance remains unexplored and unclear. In this paper, we delve into this topic and empirically find that direct feature distribution alignment on the CNN backbone only brings limited improvements, as it does not guarantee domain-invariant sequence features in the transformer for prediction. To address this issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially designed for the adaptation of detection transformers. Technically, SFA consists of a domain query-based feature alignment (DQFA) module and a token-wise feature alignment (TDA) module. In DQFA, a novel domain query is used to aggregate and align global context from the token sequence of both domains. DQFA reduces the domain discrepancy in global feature representations and object relations when deploying in the transformer encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence from both domains, which reduces the domain gaps in local and instance-level feature representations in the transformer encoder and decoder, respectively. Besides, a novel bipartite matching consistency loss is proposed to enhance the feature discriminability for robust object detection. Experiments on three challenging benchmarks show that SFA outperforms state-of-the-art domain adaptive object detection methods. Code has been made available at: https://github.com/encounter1997/SFA.
    Interactive Machine Learning: A State of the Art Review. (arXiv:2207.06196v1 [cs.LG])
    Machine learning has proved useful in many software disciplines, including computer vision, speech and audio processing, natural language processing, robotics and some other fields. However, its applicability has been significantly hampered due its black-box nature and significant resource consumption. Performance is achieved at the expense of enormous computational resource and usually compromising the robustness and trustworthiness of the model. Recent researches have been identifying a lack of interactivity as the prime source of these machine learning problems. Consequently, interactive machine learning (iML) has acquired increased attention of researchers on account of its human-in-the-loop modality and relatively efficient resource utilization. Thereby, a state-of-the-art review of interactive machine learning plays a vital role in easing the effort toward building human-centred models. In this paper, we provide a comprehensive analysis of the state-of-the-art of iML. We analyze salient research works using merit-oriented and application/task oriented mixed taxonomy. We use a bottom-up clustering approach to generate a taxonomy of iML research works. Research works on adversarial black-box attacks and corresponding iML based defense system, exploratory machine learning, resource constrained learning, and iML performance evaluation are analyzed under their corresponding theme in our merit-oriented taxonomy. We have further classified these research works into technical and sectoral categories. Finally, research opportunities that we believe are inspiring for future work in iML are discussed thoroughly.
    Multiple Kernel Clustering with Dual Noise Minimization. (arXiv:2207.06041v1 [cs.LG])
    Clustering is a representative unsupervised method widely applied in multi-modal and multi-view scenarios. Multiple kernel clustering (MKC) aims to group data by integrating complementary information from base kernels. As a representative, late fusion MKC first decomposes the kernels into orthogonal partition matrices, then learns a consensus one from them, achieving promising performance recently. However, these methods fail to consider the noise inside the partition matrix, preventing further improvement of clustering performance. We discover that the noise can be disassembled into separable dual parts, i.e. N-noise and C-noise (Null space noise and Column space noise). In this paper, we rigorously define dual noise and propose a novel parameter-free MKC algorithm by minimizing them. To solve the resultant optimization problem, we design an efficient two-step iterative strategy. To our best knowledge, it is the first time to investigate dual noise within the partition in the kernel space. We observe that dual noise will pollute the block diagonal structures and incur the degeneration of clustering performance, and C-noise exhibits stronger destruction than N-noise. Owing to our efficient mechanism to minimize dual noise, the proposed algorithm surpasses the recent methods by large margins.
    Long Term Fairness for Minority Groups via Performative Distributionally Robust Optimization. (arXiv:2207.05777v1 [cs.LG])
    Fairness researchers in machine learning (ML) have coalesced around several fairness criteria which provide formal definitions of what it means for an ML model to be fair. However, these criteria have some serious limitations. We identify four key shortcomings of these formal fairness criteria, and aim to help to address them by extending performative prediction to include a distributionally robust objective.
    Estimating Test Performance for AI Medical Devices under Distribution Shift with Conformal Prediction. (arXiv:2207.05796v1 [cs.LG])
    Estimating the test performance of software AI-based medical devices under distribution shifts is crucial for evaluating the safety, efficiency, and usability prior to clinical deployment. Due to the nature of regulated medical device software and the difficulty in acquiring large amounts of labeled medical datasets, we consider the task of predicting the test accuracy of an arbitrary black-box model on an unlabeled target domain without modification to the original training process or any distributional assumptions of the original source data (i.e. we treat the model as a "black-box" and only use the predicted output responses). We propose a "black-box" test estimation technique based on conformal prediction and evaluate it against other methods on three medical imaging datasets (mammography, dermatology, and histopathology) under several clinically relevant types of distribution shift (institution, hardware scanner, atlas, hospital). We hope that by promoting practical and effective estimation techniques for black-box models, manufacturers of medical devices will develop more standardized and realistic evaluation procedures to improve the robustness and trustworthiness of clinical AI tools.
    Radar Image Reconstruction from Raw ADC Data using Parametric Variational Autoencoder with Domain Adaptation. (arXiv:2207.06379v1 [cs.CV])
    This paper presents a parametric variational autoencoder-based human target detection and localization framework working directly with the raw analog-to-digital converter data from the frequency modulated continous wave radar. We propose a parametrically constrained variational autoencoder, with residual and skip connections, capable of generating the clustered and localized target detections on the range-angle image. Furthermore, to circumvent the problem of training the proposed neural network on all possible scenarios using real radar data, we propose domain adaptation strategies whereby we first train the neural network using ray tracing based model data and then adapt the network to work on real sensor data. This strategy ensures better generalization and scalability of the proposed neural network even though it is trained with limited radar data. We demonstrate the superior detection and localization performance of our proposed solution compared to the conventional signal processing pipeline and earlier state-of-art deep U-Net architecture with range-doppler images as inputs
    Revealing Unfair Models by Mining Interpretable Evidence. (arXiv:2207.05811v1 [cs.LG])
    The popularity of machine learning has increased the risk of unfair models getting deployed in high-stake applications, such as justice system, drug/vaccination design, and medical diagnosis. Although there are effective methods to train fair models from scratch, how to automatically reveal and explain the unfairness of a trained model remains a challenging task. Revealing unfairness of machine learning models in interpretable fashion is a critical step towards fair and trustworthy AI. In this paper, we systematically tackle the novel task of revealing unfair models by mining interpretable evidence (RUMIE). The key idea is to find solid evidence in the form of a group of data instances discriminated most by the model. To make the evidence interpretable, we also find a set of human-understandable key attributes and decision rules that characterize the discriminated data instances and distinguish them from the other non-discriminated data. As demonstrated by extensive experiments on many real-world data sets, our method finds highly interpretable and solid evidence to effectively reveal the unfairness of trained models. Moreover, it is much more scalable than all of the baseline methods.
    Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS. (arXiv:2207.06000v1 [cs.CL])
    Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.
    Shape-Aware Masking for Inpainting in Medical Imaging. (arXiv:2207.05787v1 [eess.IV])
    Inpainting has recently been proposed as a successful deep learning technique for unsupervised medical image model discovery. The masks used for inpainting are generally independent of the dataset and are not tailored to perform on different given classes of anatomy. In this work, we introduce a method for generating shape-aware masks for inpainting, which aims at learning the statistical shape prior. We hypothesize that although the variation of masks improves the generalizability of inpainting models, the shape of the masks should follow the topology of the organs of interest. Hence, we propose an unsupervised guided masking approach based on an off-the-shelf inpainting model and a superpixel over-segmentation algorithm to generate a wide range of shape-dependent masks. Experimental results on abdominal MR image reconstruction show the superiority of our proposed masking method over standard methods using square-shaped or dataset of irregular shape masks.
    Probing the Robustness of Independent Mechanism Analysis for Representation Learning. (arXiv:2207.06137v1 [stat.ML])
    One aim of representation learning is to recover the original latent code that generated the data, a task which requires additional information or inductive biases. A recently proposed approach termed Independent Mechanism Analysis (IMA) postulates that each latent source should influence the observed mixtures independently, complementing standard nonlinear independent component analysis, and taking inspiration from the principle of independent causal mechanisms. While it was shown in theory and experiments that IMA helps recovering the true latents, the method's performance was so far only characterized when the modeling assumptions are exactly satisfied. Here, we test the method's robustness to violations of the underlying assumptions. We find that the benefits of IMA-based regularization for recovering the true sources extend to mixing functions with various degrees of violation of the IMA principle, while standard regularizers do not provide the same merits. Moreover, we show that unregularized maximum likelihood recovers mixing functions which systematically deviate from the IMA principle, and provide an argument elucidating the benefits of IMA-based regularization.
    Logistics, Graphs, and Transformers: Towards improving Travel Time Estimation. (arXiv:2207.05835v1 [cs.LG])
    The problem of travel time estimation is widely considered as the fundamental challenge of modern logistics. The complex nature of interconnections between spatial aspects of roads and temporal dynamics of ground transport still preserves an area to experiment with. However, the total volume of currently accumulated data encourages the construction of the learning models which have the perspective to significantly outperform earlier solutions. In order to address the problems of travel time estimation, we propose a new method based on transformer architecture - TransTTE.
    On the Robustness of Bayesian Neural Networks to Adversarial Attacks. (arXiv:2207.06154v1 [cs.LG])
    Vulnerability to adversarial attacks is one of the principal hurdles to the adoption of deep learning in safety-critical applications. Despite significant efforts, both practical and theoretical, training deep learning models robust to adversarial attacks is still an open problem. In this paper, we analyse the geometry of adversarial attacks in the large-data, overparameterized limit for Bayesian Neural Networks (BNNs). We show that, in the limit, vulnerability to gradient-based attacks arises as a result of degeneracy in the data distribution, i.e., when the data lies on a lower-dimensional submanifold of the ambient space. As a direct consequence, we demonstrate that in this limit BNN posteriors are robust to gradient-based adversarial attacks. Crucially, we prove that the expected gradient of the loss with respect to the BNN posterior distribution is vanishing, even when each neural network sampled from the posterior is vulnerable to gradient-based attacks. Experimental results on the MNIST, Fashion MNIST, and half moons datasets, representing the finite data regime, with BNNs trained with Hamiltonian Monte Carlo and Variational Inference, support this line of arguments, showing that BNNs can display both high accuracy on clean data and robustness to both gradient-based and gradient-free based adversarial attacks.
    Understanding Unfairness in Fraud Detection through Model and Data Bias Interactions. (arXiv:2207.06273v1 [cs.LG])
    In recent years, machine learning algorithms have become ubiquitous in a multitude of high-stakes decision-making applications. The unparalleled ability of machine learning algorithms to learn patterns from data also enables them to incorporate biases embedded within. A biased model can then make decisions that disproportionately harm certain groups in society -- limiting their access to financial services, for example. The awareness of this problem has given rise to the field of Fair ML, which focuses on studying, measuring, and mitigating unfairness in algorithmic prediction, with respect to a set of protected groups (e.g., race or gender). However, the underlying causes for algorithmic unfairness still remain elusive, with researchers divided between blaming either the ML algorithms or the data they are trained on. In this work, we maintain that algorithmic unfairness stems from interactions between models and biases in the data, rather than from isolated contributions of either of them. To this end, we propose a taxonomy to characterize data bias and we study a set of hypotheses regarding the fairness-accuracy trade-offs that fairness-blind ML algorithms exhibit under different data bias settings. On our real-world account-opening fraud use case, we find that each setting entails specific trade-offs, affecting fairness in expected value and variance -- the latter often going unnoticed. Moreover, we show how algorithms compare differently in terms of accuracy and fairness, depending on the biases affecting the data. Finally, we note that under specific data bias conditions, simple pre-processing interventions can successfully balance group-wise error rates, while the same techniques fail in more complex settings.
    QT-Routenet: Improved GNN generalization to larger 5G networks by fine-tuning predictions from queueing theory. (arXiv:2207.06336v1 [cs.NI])
    In order to promote the use of machine learning in 5G, the International Telecommunication Union (ITU) proposed in 2021 the second edition of the ITU AI/ML in 5G challenge, with over 1600 participants from 82 countries. This work details the second place solution overall, which is also the winning solution of the Graph Neural Networking Challenge 2021. We tackle the problem of generalization when applying a model to a 5G network that may have longer paths and larger link capacities than the ones observed in training. To achieve this, we propose to first extract robust features related to Queueing Theory (QT), and then fine-tune the analytical baseline prediction using a modification of the Routenet Graph Neural Network (GNN) model. The proposed solution generalizes much better than simply using Routenet, and manages to reduce the analytical baseline's 10.42 mean absolute percent error to 1.45 (1.27 with an ensemble). This suggests that making small changes to an approximate model that is known to be robust can be an effective way to improve accuracy without compromising generalization.
    Open set learning with augmented category by exploiting unlabelled data (open-LACU). (arXiv:2002.01368v4 [stat.ML] UPDATED)
    Considering the nature of unlabelled data, it is common for partially labelled training datasets to contain samples that belong to novel categories. Although these so-called observed novel categories exist in the training data, they do not belong to any of the training labels. In contrast, open-sets define novel categories as those unobserved during during training, but present during testing. This research is the first to generalize between observed and unobserved novel categories within a new learning policy called open-set learning with augmented category by exploiting unlabeled data or open-LACU. This study conducts a high-level review on novelty detection so to differentiate between research fields that concern observed novel categories, and the research fields that concern unobserved novel categories. Open-LACU is then introduced as a synthesis of the relevant fields to maintain the advantages of each within a single learning policy. Currently, we are finalising the first open-LACU network which will be combined with this pre-print to be sent for publication.
    Machine Learning Assisted Approach for Security-Constrained Unit Commitment. (arXiv:2111.09824v2 [eess.SY] UPDATED)
    Security-constrained unit commitment (SCUC) is solved for power system day-ahead generation scheduling, which is a large-scale mixed-integer linear programming problem and is very computationally intensive. Model reduction of SCUC may bring significant time savings. In this work, a novel approach is proposed to effectively utilize machine learning (ML) to reduce the problem size of SCUC. An ML model using logistic regression (LR) algorithm is proposed and trained with historical nodal demand profiles and the respective commitment schedules. The ML outputs are processed and analyzed to reduce variables and constraints in SCUC. The proposed approach is validated on several standard test systems including IEEE 24-bus system, IEEE 73-bus system, IEEE 118-bus system, synthetic South Carolina 500-bus system and Polish 2383-bus system. Simulation results demonstrate that the use of the prediction from the proposed LR model in SCUC model reduction can substantially reduce the computing time while maintaining solution quality.
    Policy Optimization with Sparse Global Contrastive Explanations. (arXiv:2207.06269v1 [cs.LG])
    We develop a Reinforcement Learning (RL) framework for improving an existing behavior policy via sparse, user-interpretable changes. Our goal is to make minimal changes while gaining as much benefit as possible. We define a minimal change as having a sparse, global contrastive explanation between the original and proposed policy. We improve the current policy with the constraint of keeping that global contrastive explanation short. We demonstrate our framework with a discrete MDP and a continuous 2D navigation domain.
    Contextual Bandits with Smooth Regret: Efficient Learning in Continuous Action Spaces. (arXiv:2207.05849v1 [cs.LG])
    Designing efficient general-purpose contextual bandit algorithms that work with large -- or even continuous -- action spaces would facilitate application to important scenarios such as information retrieval, recommendation systems, and continuous control. While obtaining standard regret guarantees can be hopeless, alternative regret notions have been proposed to tackle the large action setting. We propose a smooth regret notion for contextual bandits, which dominates previously proposed alternatives. We design a statistically and computationally efficient algorithm -- for the proposed smooth regret -- that works with general function approximation under standard supervised oracles. We also present an adaptive algorithm that automatically adapts to any smoothness level. Our algorithms can be used to recover the previous minimax/Pareto optimal guarantees under the standard regret, e.g., in bandit problems with multiple best arms and Lipschitz/H{\"o}lder bandits. We conduct large-scale empirical evaluations demonstrating the efficacy of our proposed algorithms.
  • Open

    How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. (arXiv:2102.08921v2 [cs.LG] UPDATED)
    Devising domain- and model-agnostic evaluation metrics for generative models is an important and as yet unresolved problem. Most existing metrics, which were tailored solely to the image synthesis setup, exhibit a limited capacity for diagnosing the different modes of failure of generative models across broader application domains. In this paper, we introduce a 3-dimensional evaluation metric, ($\alpha$-Precision, $\beta$-Recall, Authenticity), that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity. We introduce generalization as an additional, independent dimension (to the fidelity-diversity trade-off) that quantifies the extent to which a model copies training data -- a crucial performance indicator when modeling sensitive data with requirements on privacy. The three metric components correspond to (interpretable) probabilistic quantities, and are estimated via sample-level binary classification. The sample-level nature of our metric inspires a novel use case which we call model auditing, wherein we judge the quality of individual samples generated by a (black-box) model, discarding low-quality samples and hence improving the overall model performance in a post-hoc manner.
    Rotting Infinitely Many-armed Bandits. (arXiv:2201.12975v2 [cs.LG] UPDATED)
    We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate $\varrho=o(1)$. We show that this learning problem has an $\Omega(\max\{\varrho^{1/3}T,\sqrt{T}\})$ worst-case regret lower bound where $T$ is the horizon time. We show that a matching upper bound $\tilde{O}(\max\{\varrho^{1/3}T,\sqrt{T}\})$, up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate $\varrho$. We also show that an $\tilde{O}(\max\{\varrho^{1/3}T,T^{3/4}\})$ regret upper bound can be achieved by an algorithm that does not know the value of $\varrho$, by using an adaptive UCB index along with an adaptive threshold value.
    D-CBRS: Accounting For Intra-Class Diversity in Continual Learning. (arXiv:2207.05897v1 [cs.LG])
    Continual learning -- accumulating knowledge from a sequence of learning experiences -- is an important yet challenging problem. In this paradigm, the model's performance for previously encountered instances may substantially drop as additional data are seen. When dealing with class-imbalanced data, forgetting is further exacerbated. Prior work has proposed replay-based approaches which aim at reducing forgetting by intelligently storing instances for future replay. Although Class-Balancing Reservoir Sampling (CBRS) has been successful in dealing with imbalanced data, the intra-class diversity has not been accounted for, implicitly assuming that each instance of a class is equally informative. We present Diverse-CBRS (D-CBRS), an algorithm that allows us to consider within class diversity when storing instances in the memory. Our results show that D-CBRS outperforms state-of-the-art memory management continual learning algorithms on data sets with considerable intra-class diversity.
    Hindsight Learning for MDPs with Exogenous Inputs. (arXiv:2207.06272v1 [cs.LG])
    We develop a reinforcement learning (RL) framework for applications that deal with sequential decisions and exogenous uncertainty, such as resource allocation and inventory management. In these applications, the uncertainty is only due to exogenous variables like future demands. A popular approach is to predict the exogenous variables using historical data and then plan with the predictions. However, this indirect approach requires high-fidelity modeling of the exogenous process to guarantee good downstream decision-making, which can be impractical when the exogenous process is complex. In this work we propose an alternative approach based on hindsight learning which sidesteps modeling the exogenous process. Our key insight is that, unlike Sim2Real RL, we can revisit past decisions in the historical data and derive counterfactual consequences for other actions in these applications. Our framework uses hindsight-optimal actions as the policy training signal and has strong theoretical guarantees on decision-making performance. We develop an algorithm using our framework to allocate compute resources for real-world Microsoft Azure workloads. The results show our approach learns better policies than domain-specific heuristics and Sim2Real RL baselines.
    Stochastic Functional Analysis and Multilevel Vector Field Anomaly Detection. (arXiv:2207.06229v1 [stat.ML])
    Massive vector field datasets are common in multi-spectral optical and radar sensors and modern multimodal MRI data, among many other areas of application. In this paper we develop a novel stochastic functional analysis approach for detecting anomalies based on the covariance structure of nominal stochastic behavior across a domain with multi-band vector field data. An optimal vector field Karhunen-Loeve (KL) expansion is applied to such random field data. A series of multilevel orthogonal functional subspaces is constructed from the geometry of the domain, adapted from the KL expansion. Detection is achieved by examining the projection of the random field on the multilevel basis. The anomalies can be quantified in suitable normed spaces based on local and global information. In addition, reliable hypothesis tests are formed with controllable distributions that do not require prior assumptions on probability distributions of the data. Only the covariance function is needed, which makes for significantly simpler estimates. Furthermore this approach allows stochastic vector-based fusion of anomalies without any loss of information. The method is applied to the important problem of deforestation and degradation in the Amazon forest. This is a complex non-monotonic process, as forests can degrade and recover. This particular problem is further compounded by the presence of clouds that are hard to remove with current masking algorithms. Using multi-spectral satellite data from Sentinel 2, the multilevel filter is constructed and anomalies are treated as deviations from the initial state of the forest. Forest anomalies are quantified with robust hypothesis tests and distinguished from false variations such as cloud cover. Our approach shows the advantage of using multiple bands of data in a vectorized complex, leading to better anomaly detection beyond the capabilities of scalar-based methods.
    (Nearly) Optimal Private Linear Regression via Adaptive Clipping. (arXiv:2207.04686v2 [cs.LG] UPDATED)
    We study the problem of differentially private linear regression where each data point is sampled from a fixed sub-Gaussian style distribution. We propose and analyze a one-pass mini-batch stochastic gradient descent method (DP-AMBSSGD) where points in each iteration are sampled without replacement. Noise is added for DP but the noise standard deviation is estimated online. Compared to existing $(\epsilon, \delta)$-DP techniques which have sub-optimal error bounds, DP-AMBSSGD is able to provide nearly optimal error bounds in terms of key parameters like dimensionality $d$, number of points $N$, and the standard deviation $\sigma$ of the noise in observations. For example, when the $d$-dimensional covariates are sampled i.i.d. from the normal distribution, then the excess error of DP-AMBSSGD due to privacy is $\frac{\sigma^2 d}{N}(1+\frac{d}{\epsilon^2 N})$, i.e., the error is meaningful when number of samples $N= \Omega(d \log d)$ which is the standard operative regime for linear regression. In contrast, error bounds for existing efficient methods in this setting are: $\mathcal{O}\big(\frac{d^3}{\epsilon^2 N^2}\big)$, even for $\sigma=0$. That is, for constant $\epsilon$, the existing techniques require $N=\Omega(d\sqrt{d})$ to provide a non-trivial result.
    Contextual Bandits with Large Action Spaces: Made Practical. (arXiv:2207.05836v1 [cs.LG])
    A central problem in sequential decision making is to develop algorithms that are practical and computationally efficient, yet support the use of flexible, general-purpose models. Focusing on the contextual bandit problem, recent progress provides provably efficient algorithms with strong empirical performance when the number of possible alternatives ("actions") is small, but guarantees for decision making in large, continuous action spaces have remained elusive, leading to a significant gap between theory and practice. We present the first efficient, general-purpose algorithm for contextual bandits with continuous, linearly structured action spaces. Our algorithm makes use of computational oracles for (i) supervised learning, and (ii) optimization over the action space, and achieves sample complexity, runtime, and memory independent of the size of the action space. In addition, it is simple and practical. We perform a large-scale empirical evaluation, and show that our approach typically enjoys superior performance and efficiency compared to standard baselines.
    Conformal prediction for time series. (arXiv:2010.09107v13 [stat.ME] UPDATED)
    We develop a general framework for constructing distribution-free prediction intervals for time series. Theoretically, we establish explicit bounds on conditional and marginal coverage gaps of estimated prediction intervals, which asymptotically converge to zero under additional assumptions. We obtain similar bounds on the size of set differences between oracle and estimated prediction intervals. Methodologically, we introduce a computationally efficient algorithm called EnbPI that wraps around ensemble predictors, which is closely related to conformal prediction (CP) but does not require data exchangeability. EnbPI avoids data-splitting and is computationally efficient by avoiding retraining and thus scalable to sequentially producing prediction intervals. We perform extensive simulation and real-data analyses to demonstrate its effectiveness compared with existing methods.
    TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels. (arXiv:2207.06343v1 [cs.LG])
    State-of-the-art federated learning methods can perform far worse than their centralized counterparts when clients have dissimilar data distributions. For neural networks, even when centralized SGD easily finds a solution that is simultaneously performant for all clients, current federated optimization methods fail to converge to a comparable solution. We show that this performance disparity can largely be attributed to optimization challenges presented by nonconvexity. Specifically, we find that the early layers of the network do learn useful features, but the final layers fail to make use of them. That is, federated optimization applied to this non-convex problem distorts the learning of the final layers. Leveraging this observation, we propose a Train-Convexify-Train (TCT) procedure to sidestep this issue: first, learn features using off-the-shelf methods (e.g., FedAvg); then, optimize a convexified problem obtained from the network's empirical neural tangent kernel approximation. Our technique yields accuracy improvements of up to +36% on FMNIST and +37% on CIFAR10 when clients have dissimilar data.  ( 2 min )
    Cost-Effective Online Contextual Model Selection. (arXiv:2207.06030v1 [cs.LG])
    How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.  ( 2 min )
    Towards understanding how momentum improves generalization in deep learning. (arXiv:2207.05931v1 [cs.LG])
    Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well-understood that using momentum can lead to faster convergence rate in various settings, it has also been observed that momentum yields higher generalization. Prior work argue that momentum stabilizes the SGD noise during training and this leads to higher generalization. In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized. The key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin. Contrary to GD that memorizes the small margin data, GD+M still learns the feature in these data thanks to its historical gradients. Lastly, we empirically validate our theoretical findings.  ( 2 min )
    BR-SNIS: Bias Reduced Self-Normalized Importance Sampling. (arXiv:2207.06364v1 [stat.ML])
    Importance Sampling (IS) is a method for approximating expectations under a target distribution using independent samples from a proposal distribution and the associated importance weights. In many applications, the target distribution is known only up to a normalization constant, in which case self-normalized IS (SNIS) can be used. While the use of self-normalization can have a positive effect on the dispersion of the estimator, it introduces bias. In this work, we propose a new method, BR-SNIS, whose complexity is essentially the same as that of SNIS and which significantly reduces bias without increasing the variance. This method is a wrapper in the sense that it uses the same proposal samples and importance weights as SNIS, but makes clever use of iterated sampling--importance resampling (ISIR) to form a bias-reduced version of the estimator. We furnish the proposed algorithm with rigorous theoretical results, including new bias, variance and high-probability bounds, and these are illustrated by numerical examples.  ( 2 min )
    Information-theoretic Inducing Point Placement for High-throughput Bayesian Optimisation. (arXiv:2206.02437v2 [cs.LG] UPDATED)
    Sparse Gaussian Processes are a key component of high-throughput Bayesian optimisation (BO) loops -- an increasingly common setting where evaluation budgets are large and highly parallelised. By using representative subsets of the available data to build approximate posteriors, sparse models dramatically reduce the computational costs of surrogate modelling by relying on a small set of pseudo-observations, the so-called inducing points, in lieu of the full data set. However, current approaches to design inducing points are not appropriate within BO loops as they seek to reduce global uncertainty in the objective function. Thus, the high-fidelity modelling of promising and data-dense regions required for precise optimisation is sacrificed and computational resources are instead wasted on modelling areas of the space already known to be sub-optimal. Inspired by entropy-based BO methods, we propose a novel inducing point design that uses a principled information-theoretic criterion to select inducing points. By choosing inducing points to maximally reduce both global uncertainty and uncertainty in the maximum value of the objective function, we build surrogate models able to support high-precision high-throughput BO.  ( 2 min )
    Unsupervised tree boosting for learning probability distributions. (arXiv:2101.11083v6 [stat.ME] UPDATED)
    We propose an unsupervised tree boosting algorithm for inferring the underlying sampling distribution of an i.i.d.\ sample based on fitting additive tree ensembles in a fashion analogous to supervised tree boosting. Integral to the algorithm is a new notion of "addition" on probability distributions that leads to a coherent notion of "residualization", i.e., subtracting a probability distribution from an observation to remove the distributional structure from the sampling distribution of the latter. We show that these notions arise naturally for univariate distributions through cumulative distribution function (CDF) transforms and compositions due to several "group-like" properties of univariate CDFs. While the traditional multivariate CDF does not preserve these properties, a new definition of multivariate CDF can restore these properties, thereby allowing the notions of "addition" and "residualization" to be formulated for multivariate settings as well. This then gives rise to the unsupervised boosting algorithm based on forward-stagewise fitting of an additive tree ensemble, which sequentially reduces the Kullback-Leibler divergence from the truth. The algorithm allows analytic evaluation of the fitted density and outputs a generative model that can be readily sampled from. We enhance the algorithm with scale-dependent shrinkage and a two-stage strategy that separately fits the marginals and the copula. The algorithm then performs competitively to state-of-the-art deep-learning approaches in multivariate density estimation on multiple benchmark datasets.  ( 3 min )
    Video Coding Using Learned Latent GAN Compression. (arXiv:2207.04324v2 [eess.IV] UPDATED)
    We propose in this paper a new paradigm for facial video compression. We leverage the generative capacity of GANs such as StyleGAN to represent and compress a video, including intra and inter compression. Each frame is inverted in the latent space of StyleGAN, from which the optimal compression is learned. To do so, a diffeomorphic latent representation is learned using a normalizing flows model, where an entropy model can be optimized for image coding. In addition, we propose a new perceptual loss that is more efficient than other counterparts. Finally, an entropy model for video inter coding with residual is also learned in the previously constructed latent representation. Our method (SGANC) is simple, faster to train, and achieves better results for image and video coding compared to state-of-the-art codecs such as VTM, AV1, and recent deep learning techniques. In particular, it drastically minimizes perceptual distortion at low bit rates.
    Open set learning with augmented category by exploiting unlabelled data (open-LACU). (arXiv:2002.01368v4 [stat.ML] UPDATED)
    Considering the nature of unlabelled data, it is common for partially labelled training datasets to contain samples that belong to novel categories. Although these so-called observed novel categories exist in the training data, they do not belong to any of the training labels. In contrast, open-sets define novel categories as those unobserved during during training, but present during testing. This research is the first to generalize between observed and unobserved novel categories within a new learning policy called open-set learning with augmented category by exploiting unlabeled data or open-LACU. This study conducts a high-level review on novelty detection so to differentiate between research fields that concern observed novel categories, and the research fields that concern unobserved novel categories. Open-LACU is then introduced as a synthesis of the relevant fields to maintain the advantages of each within a single learning policy. Currently, we are finalising the first open-LACU network which will be combined with this pre-print to be sent for publication.
    FedShuffle: Recipes for Better Use of Local Work in Federated Learning. (arXiv:2204.13169v2 [cs.LG] UPDATED)
    The practice of applying several local updates before aggregation across clients has been empirically shown to be a successful approach to overcoming the communication bottleneck in Federated Learning (FL). In this work, we propose a general recipe, FedShuffle, that better utilizes the local updates in FL, especially in the heterogeneous regime. Unlike many prior works, FedShuffle does not assume any uniformity in the number of updates per device. Our FedShuffle recipe comprises four simple-yet-powerful ingredients: 1) local shuffling of the data, 2) adjustment of the local learning rates, 3) update weighting, and 4) momentum variance reduction (Cutkosky and Orabona, 2019). We present a comprehensive theoretical analysis of FedShuffle and show that both theoretically and empirically, our approach does not suffer from the objective function mismatch that is present in FL methods which assume homogeneous updates in heterogeneous FL setups, e.g., FedAvg (McMahan et al., 2017). In addition, by combining the ingredients above, FedShuffle improves upon FedNova (Wang et al., 2020), which was previously proposed to solve this mismatch. We also show that FedShuffle with momentum variance reduction can improve upon non-local methods under a Hessian similarity assumption. Finally, through experiments on synthetic and real-world datasets, we illustrate how each of the four ingredients used in FedShuffle helps improve the use of local updates in FL.
    How to Train Your Wide Neural Network Without Backprop: An Input-Weight Alignment Perspective. (arXiv:2106.08453v2 [cs.LG] UPDATED)
    Recent works have examined theoretical and empirical properties of wide neural networks trained in the Neural Tangent Kernel (NTK) regime. Given that biological neural networks are much wider than their artificial counterparts, we consider NTK regime wide neural networks as a possible model of biological neural networks. Leveraging NTK theory, we show theoretically that gradient descent drives layerwise weight updates that are aligned with their input activity correlations weighted by error, and demonstrate empirically that the result also holds in finite-width wide networks. The alignment result allows us to formulate a family of biologically-motivated, backpropagation-free learning rules that are theoretically equivalent to backpropagation in infinite-width networks. We test these learning rules on benchmark problems in feedforward and recurrent neural networks and demonstrate, in wide networks, comparable performance to backpropagation. The proposed rules are particularly effective in low data regimes, which are common in biological learning settings.  ( 2 min )
    Online Active Regression. (arXiv:2207.05945v1 [cs.LG])
    Active regression considers a linear regression problem where the learner receives a large number of data points but can only observe a small number of labels. Since online algorithms can deal with incremental training data and take advantage of low computational cost, we consider an online extension of the active regression problem: the learner receives data points one by one and immediately decides whether it should collect the corresponding labels. The goal is to efficiently maintain the regression of received data points with a small budget of label queries. We propose novel algorithms for this problem under $\ell_p$ loss where $p\in[1,2]$. To achieve a $(1+\epsilon)$-approximate solution, our proposed algorithms only require $\tilde{\mathcal{O}}(\epsilon^{-2} d \log(n\kappa))$ queries of labels, where $n$ is the number of data points and $\kappa$ is a quantity, called the condition number, of the data points. The numerical results verify our theoretical results and show that our methods have comparable performance with offline active regression algorithms.  ( 2 min )
    Goal-Oriented Sensitivity Analysis of Hyperparameters in Deep Learning. (arXiv:2207.06216v1 [stat.ML])
    Tackling new machine learning problems with neural networks always means optimizing numerous hyperparameters that define their structure and strongly impact their performances. In this work, we study the use of goal-oriented sensitivity analysis, based on the Hilbert-Schmidt Independence Criterion (HSIC), for hyperparameter analysis and optimization. Hyperparameters live in spaces that are often complex and awkward. They can be of different natures (categorical, discrete, boolean, continuous), interact, and have inter-dependencies. All this makes it non-trivial to perform classical sensitivity analysis. We alleviate these difficulties to obtain a robust analysis index that is able to quantify hyperparameters' relative impact on a neural network's final error. This valuable tool allows us to better understand hyperparameters and to make hyperparameter optimization more interpretable. We illustrate the benefits of this knowledge in the context of hyperparameter optimization and derive an HSIC-based optimization algorithm that we apply on MNIST and Cifar, classical machine learning data sets, but also on the approximation of Runge function and Bateman equations solution, of interest for scientific machine learning. This method yields neural networks that are both competitive and cost-effective.  ( 2 min )
    Shrinkage Estimation of Higher Order Bochner Integrals. (arXiv:2207.06357v1 [math.ST])
    We consider shrinkage estimation of higher order Hilbert space valued Bochner integrals in a non-parametric setting. We propose estimators that shrink the $U$-statistic estimator of the Bochner integral towards a pre-specified target element in the Hilbert space. Depending on the degeneracy of the kernel of the $U$-statistic, we construct consistent shrinkage estimators with fast rates of convergence, and develop oracle inequalities comparing the risks of the the $U$-statistic estimator and its shrinkage version. Surprisingly, we show that the shrinkage estimator designed by assuming complete degeneracy of the kernel of the $U$-statistic is a consistent estimator even when the kernel is not complete degenerate. This work subsumes and improves upon Krikamol et al., 2016, JMLR and Zhou et al., 2019, JMVA, which only handle mean element and covariance operator estimation in a reproducing kernel Hilbert space. We also specialize our results to normal mean estimation and show that for $d\ge 3$, the proposed estimator strictly improves upon the sample mean in terms of the mean squared error.  ( 2 min )
    Long Term Fairness for Minority Groups via Performative Distributionally Robust Optimization. (arXiv:2207.05777v1 [cs.LG])
    Fairness researchers in machine learning (ML) have coalesced around several fairness criteria which provide formal definitions of what it means for an ML model to be fair. However, these criteria have some serious limitations. We identify four key shortcomings of these formal fairness criteria, and aim to help to address them by extending performative prediction to include a distributionally robust objective.  ( 2 min )
    Contextual Decision Trees. (arXiv:2207.06355v1 [stat.ML])
    Focusing on Random Forests, we propose a multi-armed contextual bandit recommendation framework for feature-based selection of a single shallow tree of the learned ensemble. The trained system, which works on top of the Random Forest, dynamically identifies a base predictor that is responsible for providing the final output. In this way, we obtain local interpretations by observing the rules of the recommended tree. The carried out experiments reveal that our dynamic method is superior to an independent fitted CART decision tree and comparable to the whole black-box Random Forest in terms of predictive performances.  ( 2 min )
    Constraint-Based Causal Structure Learning from Undersampled Graphs. (arXiv:2205.09235v2 [stat.ML] UPDATED)
    Graphical structures estimated by causal learning algorithms from time series data can provide highly misleading causal information if the causal timescale of the generating process fails to match the measurement timescale of the data. Although this problem has been recently recognized, practitioners have limited resources to respond to it, and so must continue using models that they know are likely misleading. Existing methods either (a) require that the difference between causal and measurement timescales is known; or (b) can handle only very small number of random variables when the timescale difference is unknown; or (c) apply to only pairs of variables, though with fewer assumptions about prior knowledge; or (d) return impractically too many solutions. This paper addresses all four challenges. We combine constraint programming with both theoretical insights into the problem structure and prior information about admissible causal interactions. The resulting system provides a practical approach that scales to significantly larger sets (>100) of random variables, does not require precise knowledge of the timescale difference, supports edge misidentification and parametric connection strengths, and can provide the optimum choice among many possible solutions. The cumulative impact of these improvements is gain of multiple orders of magnitude in speed and informativeness.  ( 3 min )
    Learning Bellman Complete Representations for Offline Policy Evaluation. (arXiv:2207.05837v1 [cs.LG])
    We study representation learning for Offline Reinforcement Learning (RL), focusing on the important task of Offline Policy Evaluation (OPE). Recent work shows that, in contrast to supervised learning, realizability of the Q-function is not enough for learning it. Two sufficient conditions for sample-efficient OPE are Bellman completeness and coverage. Prior work often assumes that representations satisfying these conditions are given, with results being mostly theoretical in nature. In this work, we propose BCRL, which directly learns from data an approximately linear Bellman complete representation with good coverage. With this learned representation, we perform OPE using Least Square Policy Evaluation (LSPE) with linear functions in our learned representation. We present an end-to-end theoretical analysis, showing that our two-stage algorithm enjoys polynomial sample complexity provided some representation in the rich class considered is linear Bellman complete. Empirically, we extensively evaluate our algorithm on challenging, image-based continuous control tasks from the Deepmind Control Suite. We show our representation enables better OPE compared to previous representation learning methods developed for off-policy RL (e.g., CURL, SPR). BCRL achieve competitive OPE error with the state-of-the-art method Fitted Q-Evaluation (FQE), and beats FQE when evaluating beyond the initial state distribution. Our ablations show that both linear Bellman complete and coverage components of our method are crucial.  ( 3 min )
    A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP. (arXiv:2207.06147v1 [cs.LG])
    As an important framework for safe Reinforcement Learning, the Constrained Markov Decision Process (CMDP) has been extensively studied in the recent literature. However, despite the rich results under various on-policy learning settings, there still lacks some essential understanding of the offline CMDP problems, in terms of both the algorithm design and the information theoretic sample complexity lower bound. In this paper, we focus on solving the CMDP problems where only offline data are available. By adopting the concept of the single-policy concentrability coefficient $C^*$, we establish an $\Omega\left(\frac{\min\left\{|\mathcal{S}||\mathcal{A}|,|\mathcal{S}|+I\right\} C^*}{(1-\gamma)^3\epsilon^2}\right)$ sample complexity lower bound for the offline CMDP problem, where $I$ stands for the number of constraints. By introducing a simple but novel deviation control mechanism, we propose a near-optimal primal-dual learning algorithm called DPDL. This algorithm provably guarantees zero constraint violation and its sample complexity matches the above lower bound except for an $\tilde{\mathcal{O}}((1-\gamma)^{-1})$ factor. Comprehensive discussion on how to deal with the unknown constant $C^*$ and the potential asynchronous structure on the offline dataset are also included.  ( 2 min )
    Surrogate Likelihoods for Variational Annealed Importance Sampling. (arXiv:2112.12194v2 [stat.ML] UPDATED)
    Variational inference is a powerful paradigm for approximate Bayesian inference with a number of appealing properties, including support for model learning and data subsampling. By contrast MCMC methods like Hamiltonian Monte Carlo do not share these properties but remain attractive since, contrary to parametric methods, MCMC is asymptotically unbiased. For these reasons researchers have sought to combine the strengths of both classes of algorithms, with recent approaches coming closer to realizing this vision in practice. However, supporting data subsampling in these hybrid methods can be a challenge, a shortcoming that we address by introducing a surrogate likelihood that can be learned jointly with other variational parameters. We argue theoretically that the resulting algorithm permits the user to make an intuitive trade-off between inference fidelity and computational cost. In an extensive empirical comparison we show that our method performs well in practice and that it is well-suited for black-box inference in probabilistic programming frameworks.  ( 2 min )
    Probing the Robustness of Independent Mechanism Analysis for Representation Learning. (arXiv:2207.06137v1 [stat.ML])
    One aim of representation learning is to recover the original latent code that generated the data, a task which requires additional information or inductive biases. A recently proposed approach termed Independent Mechanism Analysis (IMA) postulates that each latent source should influence the observed mixtures independently, complementing standard nonlinear independent component analysis, and taking inspiration from the principle of independent causal mechanisms. While it was shown in theory and experiments that IMA helps recovering the true latents, the method's performance was so far only characterized when the modeling assumptions are exactly satisfied. Here, we test the method's robustness to violations of the underlying assumptions. We find that the benefits of IMA-based regularization for recovering the true sources extend to mixing functions with various degrees of violation of the IMA principle, while standard regularizers do not provide the same merits. Moreover, we show that unregularized maximum likelihood recovers mixing functions which systematically deviate from the IMA principle, and provide an argument elucidating the benefits of IMA-based regularization.  ( 2 min )
    Learning Approximately Optimal Contracts. (arXiv:1811.06736v2 [cs.GT] UPDATED)
    In principal-agent models, a principal offers a contract to an agent to perform a certain task. The agent exerts a level of effort that maximizes her utility. The principal is oblivious to the agent's chosen level of effort, and conditions her wage only on possible outcomes. In this work, we consider a model in which the principal is unaware of the agent's utility and action space: she sequentially offers contracts to identical agents, and observes the resulting outcomes. We present an algorithm for learning the optimal contract under mild assumptions. We bound the number of samples needed for the principal to obtain a contract that is within $\eps$ of her optimal net profit for every $\eps>0$. Our results are robust even when considering risk-averse agents. Furthermore, we show that when there are only two possible outcomes or the agent is risk-neutral, the algorithm's outcome approximates the optimal contract described in the classical theory.  ( 2 min )
    Multi-Study Boosting: Theoretical Considerations for Merging vs. Ensembling. (arXiv:2207.04588v2 [stat.ML] UPDATED)
    Cross-study replicability is a powerful model evaluation criterion that emphasizes generalizability of predictions. When training cross-study replicable prediction models, it is critical to decide between merging and treating the studies separately. We study boosting algorithms in the presence of potential heterogeneity in predictor-outcome relationships across studies and compare two multi-study learning strategies: 1) merging all the studies and training a single model, and 2) multi-study ensembling, which involves training a separate model on each study and ensembling the resulting predictions. In the regression setting, we provide theoretical guidelines based on an analytical transition point to determine whether it is more beneficial to merge or to ensemble for boosting with linear learners. In addition, we characterize a bias-variance decomposition of estimation error for boosting with component-wise linear learners. We verify the theoretical transition point result in simulation and illustrate how it can guide the decision on merging vs. ensembling in an application to breast cancer gene expression data.  ( 2 min )
    Jackknife Variability Estimation For Randomized Matrix Computations. (arXiv:2207.06342v1 [math.NA])
    Randomized algorithms based on sketching have become a workhorse tool in low-rank matrix approximation. To use these algorithms safely in applications, they should be coupled with diagnostics to assess the quality of approximation. To meet this need, this paper proposes a jackknife resampling method to estimate the variability of the output of a randomized matrix computation. The variability estimate can recognize that a computation requires additional data or that the computation is intrinsically unstable. As examples, the paper studies jackknife estimates for two randomized low-rank matrix approximation algorithms. In each case, the operation count for the jackknife estimate is independent of the dimensions of the target matrix. In numerical experiments, the estimator accurately assesses variability and also provides an order-of-magnitude estimate of the mean-square error.  ( 2 min )
    Employing Feature Selection Algorithms to Determine the Immune State of Mice with Rheumatoid Arthritis. (arXiv:2207.05882v1 [stat.ML])
    The immune response is a dynamic process by which the body determines whether an antigen is self or nonself. The state of this dynamic process is defined by the relative balance and population of inflammatory and regulatory actors which comprise this decision making process. The goal of immunotherapy as applied to, e.g. Rheumatoid Arthritis (RA), then, is to bias the immune state in favor of the regulatory actors - thereby shutting down autoimmune pathways in the response. While there are several known approaches to immunotherapy, the effectiveness of the therapy will depend on how this intervention alters the evolution of this state. Unfortunately, this process is determined not only by the dynamics of the process, but the state of the system at the time of intervention - a state which is difficult if not impossible to determine prior to application of the therapy.  ( 2 min )
    Contextual Bandits with Smooth Regret: Efficient Learning in Continuous Action Spaces. (arXiv:2207.05849v1 [cs.LG])
    Designing efficient general-purpose contextual bandit algorithms that work with large -- or even continuous -- action spaces would facilitate application to important scenarios such as information retrieval, recommendation systems, and continuous control. While obtaining standard regret guarantees can be hopeless, alternative regret notions have been proposed to tackle the large action setting. We propose a smooth regret notion for contextual bandits, which dominates previously proposed alternatives. We design a statistically and computationally efficient algorithm -- for the proposed smooth regret -- that works with general function approximation under standard supervised oracles. We also present an adaptive algorithm that automatically adapts to any smoothness level. Our algorithms can be used to recover the previous minimax/Pareto optimal guarantees under the standard regret, e.g., in bandit problems with multiple best arms and Lipschitz/H{\"o}lder bandits. We conduct large-scale empirical evaluations demonstrating the efficacy of our proposed algorithms.  ( 2 min )

  • Open

    [R] How to learn imbalanced data arising from multiple domains?
    Hello everyone! Happy to share our new work on learning from multi-domain imbalanced data. This work was recently accepted at ECCV 2022. Data imbalance is ubiquitous and inherent in the real world. Existing methods for dealing with imbalanced data/long-tailed distribution are only for single domain, that is, the data originates from the same domain; however, natural data can originate from distinct domains, where a minority class in one domain could have abundant instances from other domains. Effectively utilizing data from different domains is likely to improve the performance of long-tail learning over all domains. This paper promotes the paradigm of the traditional imbalanced classification problem and generalizes it from single domain to multiple domains. We formulate the problem of …  ( 89 min )
    [P] Introducing BentoML 1.0 - A faster way to ship your models to production
    Hi everyone! I'm excited to share some news from the BentoML team. When we first open sourced the BentoML project in 2019 and shared it with the community, our vision was to create an open platform that simplifies machine learning model serving and provides a solid foundation for ML teams to operate ML at production scale. And after years of working together with our community towards that goal, we’re thrilled to announce the general availability of BentoML 1.0! What's new in BentoML 1.0? Simplify model packaging and management, both locally and a centralized model repository for teams. A Python-first architecture that scales with powerful optimizations, including parallel inference, adaptive batching, and support for accelerated runtimes. Introducing Yatai for BentoML: Production-first ML platform on Kubernetes ​ To learn more: Introducing BentoML 1.0 Blog post: https://modelserving.com/blog/introducing-bentoml-10 BentoML Tutorial: https://docs.bentoml.org/en/latest/tutorial.html Github Page: https://github.com/bentoml/BentoML Documentation: https://docs.bentoml.org/ submitted by /u/chaoyu [link] [comments]  ( 88 min )
    [N] Andrej Karpathy is leaving Tesla
    Twitter thread: https://twitter.com/karpathy/status/1547332300186066944 submitted by /u/EffectSizeQueen [link] [comments]  ( 92 min )
    [D] I made a site for collaborative image labeling
    I recently launched https://mekabytes.com. The idea is to treat datasets like subreddits where users can come together to build the stuff they want to see. For the datasets there is a github-style landing page with a README to help give guidance on the goals, what images the dataset wants, and any labeling guidelines. There is also a reddit-style comment system where you can reference specific annotations. The idea with that is to provide feedback to help people learn. The coolest part (IMO) is the versioning system. All annotations are versioned and approved by a moderator, gating data quality kind of like a code review. This versioning allows the dataset to be rolled back to any point in time which will help reproduce research even as the dataset continues to evolve. The dataset releases will be open under a creative commons license (BY-NC-SA). To help cover hosting the releases are downloadable for $5 + $1/GB. Basically you can use it for research, personal projects, and share freely once you have it. There is still a ton of stuff to do and I don't even have my first user yet! I've been using it for the last week or so and cleaning up the UX. You can actually annotate decently on mobile. Right now it supports classification and object detection (bounding boxes). I hope to add a free text field in the near future after some niceties like pagination and comment notifications. I would love some feedback if you have any! submitted by /u/tacixat [link] [comments]  ( 88 min )
    30% of Google's Reddit Emotions Dataset is Mislabeled [D]
    Last year, Google released their Reddit Emotions dataset: a collection of 58K Reddit comments human-labeled according to 27 emotions. I analyzed the dataset... and found that a 30% is mislabeled! Some of the errors: *aggressively tells friend I love them\* – mislabeled as ANGER Yay, cold McDonald's. My favorite. – mislabeled as LOVE Hard to be sad these days when I got this guy with me – mislabeled as SADNESS Nobody has the money to. What a joke – mislabeled as JOY ​ I wrote a blog about it here, with more examples and my main two suggestions for how to fix Google's data annotation methodology. submitted by /u/BB4evaTB12 [link] [comments]  ( 92 min )
    [D] How are People Doing “Fair” Few-Shot Training/Evaluation
    After reading through a lot of the non-Meta Learning popular few-shot literature (Prototypical Nets, Matching Nets, etc.) and then looking at other papers/GitHub repos, I’m not totally sure how to build a “fair” training and evaluation setup. Let’s take CIFAR-100 (ignoring CIFAR-FS for now). To set up a few-shot dataset split, I’d take the 100 classes and split up into train/val/test 60/20/40 such that each split has non-overlapping classes - pretty straightforward. But now, I still have 600 examples per class in all splits. Before generating random 5-way-5-shot episodes during training, what’s the fair way to generate Support and Query Sets? Are people first creating another split of the trainset so that the Support set only contains 5 examples per class (60*5=300 total examples) and the rest is in the Query set? If not, something like that then the support set is going to contain a lot of examples to learn from rather than a few. Some methods also directly classify the trainset’s support images for pre-training, assuming that the number of classes overall is known beforehand. But then to do same on the validation and support sets I guess that they replace the FC layer. Finally, when choosing a pre-trained model to start with, it seems absolutely necessary to choose a significantly different domain for evaluation (ex. ImageNet pre-trained ResNet evaluated on CIFAR-FS is bad). tldr; it seems like there’s a lot of small differences in experimental setups for few-shot settings, what’s the best way to be fair for training/evaluation? Also maybe I’m just totally missing something :) submitted by /u/rivew [link] [comments]  ( 89 min )
    [N] [CFP] Order Up! A workshop on higher-order optimization in ML
    Hello all! Since NeurIPS 2022 workshop decisions were recently released, we are proud to announce our 2022 workshop focused on higher-order optimization in machine learning! An (under construction) homepage can be found here. Topics include: Higher-order optimizers, Adaptive gradient methods, Quasi-Newton techniques, and many more! The workshop will run for one day in-person at NeurIPS 2022. There will be dedicated poster and spotlight sessions, including a dedicated junior researcher poster session with an aim to connect junior researchers to more senior ones. We also feature 5 plenary talks from researchers, namely Amir Gholami, Coralia Cartis, Donald Goldfarb, Frank E. Curtis, and Madaleine Udell. ​ We aim to provide submissions 3 reviews each. Paper submission will open soon, and can be found at this link. ​ I am happy to answer any questions, so feel free to DM or comment! Thanks. submitted by /u/order-up-workshop [link] [comments]  ( 88 min )
    [P] Build a Machine Translation System with Forte
    TLDR: This tutorial allows you to build a machine translation system with no glue code using Forte, an open source ML workflow builder. ​ Forte makes it easy to compose any NLP pipeline, regardless of heterogeneity of data and processes, as a modular and easily editable system. It allows users to break down complex problems into composable pipelines and enables inter-operations across tasks through a unified data format. This tutorial includes: 1 — How to read data from source How to create a simple NLP pipeline How to maintain and store the input data 2 — How to process data in pipeline How to perform sentence segmentation How to annotate and query the data How to translate the input text with a pre-trained model How to manage multiple data objects 3 — How to handle ne…  ( 106 min )
    [D] Ensemble regression model - based on models trained on different feature spaces
    What is the best method for constructing an ensemble regression model from numerous KNN regression models that were trained on slightly different feature spaces? I can't only use the features that they have in common. submitted by /u/Rafaelkoll [link] [comments]  ( 87 min )
    [D] When will Neurips 2022 reviews be released?
    I cant recall what day the last couple of years reviews have been released. I know that the review period is closed and so its only a matter of time just wondering if anyone has any idea? submitted by /u/AbjectDrink3276 [link] [comments]  ( 88 min )
    [News] Jupyter Notebook competition - 2 weeks left to enter!
    Are you passionate about #coding, #DataScience or #EarthObservation? 📷 Don't miss out on the chance to showcase your skills and develop new Jupyter Notebooks using #Copernicus data, whilst also being in with a chance of winning cash 📷 prizes! Sign up before 31 July at: https://notebook.wekeo.eu/ https://preview.redd.it/1uwo4ccv4bb91.png?width=1920&format=png&auto=webp&s=18af6de36526d30585d0027d8445f56ed4302516 submitted by /u/EUMETSAT [link] [comments]  ( 87 min )
    [R] Inner Monologue: Embodied Reasoning through Planning with Language Models
    submitted by /u/red75prime [link] [comments]  ( 87 min )
    [D] How best to handle a column that can hold multiple, unbounded number of values?
    Say I have an email dataset. Two of its columns are "sender" and "recipients". Now, the "sender" column will only hold one value in each row. However, "recipients" can be anything in number from 1 to 100, or even more theoretically. In such a scenario, one hot encoding is not a tractable solution. And neither is creating a new row for each unique recipient. So, how best to handle this situation? submitted by /u/ResearcherNo4728 [link] [comments]  ( 89 min )
    [R] So someone actually peer-reviewed this and thought "yeah, looks good"?
    It looks like chronic kidney disease diagnosis has been solved in this paper: https://ieeexplore.ieee.org/document/8693581 I mean no disrespect to the authors, but this publication makes me slightly doubt the peer-review system. Or I am just such an amateur, that I am not seeing the brilliance behind this paper, which is also possible. Have a read through it yourselves submitted by /u/fanconic [link] [comments]  ( 97 min )
    [D] Labeling novel view synthesis for object detection
    Hey all, I've been following the exciting progress of NeRFs, and it lead to me wonder whether there are research on generating novel 2D views from 3D representation, and labeling those examples. I find works for image classification under Novel View Synthesis topics, but for object detection I just can't find anything. Wouldn't it be possible to label 2D training images, construct 3D representation, and use it for generating novel 2D views with corresponding labelings? I see this as highly useful for the object detection domain, where labeling often requires a lot of manual work leading to small datasets and non-robust object representations. Please note if I'm missing something out here. submitted by /u/TemppaHemppa [link] [comments]  ( 88 min )
    [D] tranfer learning with freezing vs unfreezing
    Hi, I have been trying to test self-supervised representation learning on vision-task. In more detail, testing BYOL in cifar-10. I found the trick that they threw away the last layer and put a new layer for the output shape, and the backbone network is frozen during finetuning. I know that the bad last layer can harm to the backbone network during finetuning because network is highly sensitive to even small change in parameter space. But I tried to finetune without freezing, It shows better last performance(accuracy 82% -> 90% at test). So why did they freeze the backbone network and show the results of the experiment? How can I explain this phenomenon? Thank you for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 88 min )
    Why do Transformers scale so well? [D]
    When you hear people talk about large models, they're usually talking about transformers. What about this architecture has allowed it to be scaled? Have people tried making really large CNNs or RNNs (or just regular MLPs) before? submitted by /u/Adolphins [link] [comments]  ( 92 min )
  • Open

    How does SimSwap (1 image Face Swap tech) work without training?
    SimSwap (https://github.com/neuralchen/SimSwap) is basically a framework that carries out face-swapping in a similar way deepfake technology does with a source and a target video. However, for the source, only one image is required. Not sure how this would work since 1 image isn't enough for actual training. Is this simply face mapping? I feel like the output is a bit too sophisticated for that. submitted by /u/thr0away89 [link] [comments]  ( 86 min )
    Not of This World | Cinematic 4K 24 FPS (FILM)
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 86 min )
    Live developer workshop on how to generate and use synthetic text
    submitted by /u/Repeat-or [link] [comments]  ( 86 min )
    Hello fellow researchers, I’m in a bit of a pickle and require your help. Can one of you get in contact with me for a quick interview at any time regarding the risk of value destruction through the use of artificial intelligence and machine learning. You will have complete anonymity. Thank you
    submitted by /u/Normal-Opportunity33 [link] [comments]  ( 86 min )
    AI Dream 45 - Exploring the Perfect endless Garden
    submitted by /u/LordPewPew777 [link] [comments]  ( 86 min )
    is there any "image to text" ai?
    for example that writes a description of an image or an article based on it. submitted by /u/jose3001 [link] [comments]  ( 86 min )
    Deepmind PLATO: Disappointed expectations and their relevance for physics
    submitted by /u/much_successes [link] [comments]  ( 85 min )
    Colossal-AI, A Unified Deep Learning System for Big Models, Seamlessly Accelerates Large Models at Low Costs with Hugging Face​
    In recent years, the outstanding performance of model scaling has led to an escalation in the size of pre-trained models. Unfortunately, training and even simply fine-tuning large AI models are usually unaffordable, requiring tens or hundreds of GPUs. Existing deep learning frameworks like PyTorch and Tensorflow may not offer a satisfactory solution for very large AI models. Furthermore, advanced knowledge of AI systems is typically required for sophisticated configurations and optimization of specific models. Therefore, many AI users, such as engineers from small and medium-sized enterprises, can’t help but feel overwhelmed by the emergence of large AI models. Accelerate Large Model OPT with Low Cost About Open Pretrained Transformer (OPT) Meta recently released Open Pretrained Transformer (OPT), a 175-Billion parameter AI language model. To encourage AI democratization in the community, Meta has released both the code and trained model weights, which stimulates AI programmers to perform various downstream tasks and application deployments. We will now demonstrate fine-tuning Casual Language Modelling with pre-training weights of the OPT model provided by Hugging Face Hub. Configure with Colossal-AI It is very simple to use the powerful features of Colossal-AI. Users only need a simple configuration file, and are not required to alter their training logic to equip models with their desired features (e.g. mixed-precision training, gradient accumulation, multi-dimensional parallel training, and memory redundancy elimination). Continue reading | Github submitted by /u/ai-lover [link] [comments]  ( 87 min )
    Looking for an AI that i can feed stories to and it will produce a similar story.
    submitted by /u/lxknvlk [link] [comments]  ( 86 min )
    AI vs Datascience
    Is AI and Datascience a blooming field in the US ? I'd like to persue my masters in the US in either in AI or in Datascience. Which stream is better and why ? submitted by /u/Consistent-Flow-414 [link] [comments]  ( 86 min )
    Made a synopsis of my idea of anime "Alter Ego"
    submitted by /u/Mrhelloistaken [link] [comments]  ( 85 min )
    CVPR 2022 Best Paper Honorable Mention: Dual-Shutter Optical Vibration Sensing
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 86 min )
    Artificial Intelligence companies honoured at World’s Top 50 Innovators 2022 in London
    submitted by /u/chelsea_bear [link] [comments]  ( 86 min )
    Using AI to create recipes from the only food in the house (copy.ai GPT3 powered)
    I wanted to see how creative GPT3 could get in creating a recipe for a meal with all I can find in the house . I was.. kind of disappointed.. ​ The prompt : ​ a recipe for a meal with lemons, hot dog sausages, ice cream and coconut ​ The AI generated Recipe : ​ This recipe is so easy, it's almost too good to be true. ​ What you'll need: ​ - 1 lemon ​ - 5 hot dog sausages (If you use vegetarian sausages, the meal will be completely vegan!) ​ - 1 scoop of ice cream (We recommend coconut flavor—it goes so well with the lemons and sausages!) ​ - a pinch of salt (optional) ​ Preparation: ​ Peel the lemon and cut it into slices. ​ Put the hot dog sausages in a pan and fry them until they are browned on both sides. You can add some salt to taste if you like! ​ Add the lemon slices to the pan and cook them until they are golden brown as well. Make sure that they don't burn! This would ruin your meal! You can add some salt again if needed (but try not to add too much). ​ Finally, take out everything from the pan, put it on a plate and top it with ice cream! submitted by /u/No_Condition4115 [link] [comments]  ( 88 min )
    Are all my efforts in vain
    I’ve spent thousands of hours and many years building up my skills for the sole purpose of getting a job as a concept artist. Was all that in vain? With AI we have the tools to create artwork in a fraction of the time it takes a human and they will only get better. I am extremely excited for the future of the industry, but I need to know how much of my life I’ve wasted. I’m having a bit of an existential crisis. submitted by /u/giantpokimanestatue [link] [comments]  ( 90 min )
    Realistic Synthetic Video Avatars (text to video)
    I've been looking into Synthetic Media, specifically AI spokes people or AI Generated video avatars, which whilst maybe not as exciting as Dalle still has some powerful applications. I've found the below examples. Wondering if anybody has come across any useful GIT pages or Colab notebooks in this domain.. I can't seem to find detail on specific models being used, assuming they're GAN models.. I'd like to be able to explore further without having to pay $3 per minute of generated video and being capped at 10 minutes a month https://www.colossyan.com/ Movio - AI Spokesperson Video Creator https://talkingavatar.la/ https://www.rephrase.ai/ https://aistudios.com/ https://synthesys.io/ Create - adam2eve.ai https://www.deepword.co/ submitted by /u/No_Condition4115 [link] [comments]  ( 86 min )
    What Does Artificial Intelligence Means ? How AI Works ?
    submitted by /u/Maruf2014 [link] [comments]  ( 86 min )
    9 Best Artificial Intelligence books for beginners to expert to read in 2022 -
    submitted by /u/Lakshmireddys [link] [comments]  ( 84 min )
  • Open

    Full Lecture Now Available on YouTube - Stanford CS25 l Transformers United - Decision Transformer: Reinforcement Learning via Sequence Modeling: Aditya Grover of UCLA
    In this seminar Aditya introduces a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. Watch on YouTube. submitted by /u/Stanford_Online [link] [comments]  ( 86 min )
    "Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents", Huang et al 2022 {G}
    submitted by /u/gwern [link] [comments]  ( 86 min )
    "Inner Monologue: Embodied Reasoning through Planning with Language Models", Huang et al 2022 {G} (extending SayCan PaLM robotics with feedback)
    submitted by /u/gwern [link] [comments]  ( 86 min )
    Hebbian learning is enough for AGI
    Impact Maximization via Hebbian Learning is an approach for AGI and ASI. The approach posits 3 main points 1) Making an impact is the objective function of all life forms n so the objective function of AGI is to maximise impact. Living thing have more impact potential than non-living things and highly intelligent beings haveeven more potential for impact. ANything we do is a kind of impact whether is self-preservation, procreation, meme propogation, DDAO and other derived objectives, and so maximising impact is what an AGI system should do. 2)Impact maximization can happen this way – If an agent relaxes(suspends output action) while perceiving something impactful and action when it perceives a lack of impact/novelity/interesting thing, so as to bring about a change in the environment, it…  ( 89 min )
    Hindsight experience replay in Vectorised environments
    Hi there, I've been using StableBaselines3 and multiprocessing (SubprocVecEnv) however when I add my HER replay buffer it all breaks down. I get the error: ValueError: could not broadcast input array from shape (4,5) into shape (5,) ​ where the 4 is the number of environments and 5 is the action size. Thanks for any advice :) submitted by /u/SuperDuperDooken [link] [comments]  ( 87 min )
  • Open

    The Business Impact of Robotic Process Automation
    In this interview, I spoke with Husan Mahey, author of “Robotic Process Automation with Automation Anywhere,” where he outlines step-by-step the process for setting up automation in a business setting. Robotic Process Automation is a tool that allows users to automate repetitive tasks that would normally be done by a human. These sorts of tedious… Read More »The Business Impact of Robotic Process Automation The post The Business Impact of Robotic Process Automation appeared first on Data Science Central.  ( 20 min )
  • Open

    Rewriting Image Captions for Visual Question Answering Data Creation
    Posted by Soravit Beer Changpinyo and Doron Kukliansky‎, Senior Software Engineers, Google Research Visual Question Answering (VQA) is a useful machine learning (ML) task that requires a model to answer a visual question about an image. What makes it challenging is its multi-task and open-ended nature; it involves solving multiple technical research questions in computer vision and natural language understanding simultaneously. Yet, progress on this task would enable a wide range of applications, from assisting the blind and the visually-impaired or communicating with robots to enhancing the user’s visual experience with external knowledge. Effective and robust VQA systems cannot exist without high-quality, semantically and stylistically diverse large-scale training data of image-questio…  ( 21 min )
  • Open

    Reality check !
    Hello experts I am trying to make a small scale neural cryptography application . I would like to know that (a) if it is feasible to demonstrate this ( proof of concept ) using my home system. (b) will it require pro coding standards , I am an intermediate coder . Thanks in anticipation submitted by /u/Ashamed-Association3 [link] [comments]  ( 86 min )
    Master thesis Neural Networks
    I need to write a MSc thesis for Faculty of Computer Science related to Neural Networks. I am interested in Finance/Economics. At the beginning I started to read about stock return prediction/portfolio selection, but because everyone is doing it, I would like to research something different. What else Economics/Finance related can I write a thesis about? submitted by /u/AnyJello605 [link] [comments]  ( 87 min )
  • Open

    Artificial intelligence in medical diagnosis: methods, algorithms and applications
    Artificial intelligence (AI) has become synonymous with assistance and efficiency. From a technology that was looked at with mistrust as…  ( 10 min )
  • Open

    Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulations. (arXiv:2104.04797v5 [cs.DC] UPDATED)
    Machine learning (ML)-based steering can improve the performance of ensemble-based simulations by allowing for online selection of more scientifically meaningful computations. We present DeepDriveMD, a framework for ML-driven steering of scientific simulations that we have used to achieve orders-of-magnitude improvements in molecular dynamics (MD) performance via effective coupling of ML and HPC on large parallel computers. We discuss the design of DeepDriveMD and characterize its performance. We demonstrate that DeepDriveMD can achieve between 100-1000x acceleration for protein folding simulations relative to other methods, as measured by the amount of simulated time performed, while covering the same conformational landscape as quantified by the states sampled during a simulation. Experiments are performed on leadership-class platforms on up to 1020 nodes. The results establish DeepDriveMD as a high-performance framework for ML-driven HPC simulation scenarios, that supports diverse MD simulation and ML back-ends, and which enables new scientific insights by improving the length and time scales accessible with current computing capacity.  ( 3 min )
    Autoencoding Conditional GAN for Portfolio Allocation Diversification. (arXiv:2207.05701v1 [q-fin.PM])
    Over the decades, the Markowitz framework has been used extensively in portfolio analysis though it puts too much emphasis on the analysis of the market uncertainty rather than on the trend prediction. While generative adversarial network (GAN) and conditional GAN (CGAN) have been explored to generate financial time series and extract features that can help portfolio analysis. The limitation of the CGAN framework stands in putting too much emphasis on generating series rather than keeping features that can help this generator. In this paper, we introduce an autoencoding CGAN (ACGAN) based on deep generative models that learns the internal trend of historical data while modeling market uncertainty and future trends. We evaluate the model on several real-world datasets from both the US and Europe markets, and show that the proposed ACGAN model leads to better portfolio allocation and generates series that are closer to true data compared to the existing Markowitz and CGAN approaches.  ( 2 min )
    AGBoost: Attention-based Modification of Gradient Boosting Machine. (arXiv:2207.05724v1 [cs.LG])
    A new attention-based model for the gradient boosting machine (GBM) called AGBoost (the attention-based gradient boosting) is proposed for solving regression problems. The main idea behind the proposed AGBoost model is to assign attention weights with trainable parameters to iterations of GBM under condition that decision trees are base learners in GBM. Attention weights are determined by applying properties of decision trees and by using the Huber's contamination model which provides an interesting linear dependence between trainable parameters of the attention and the attention weights. This peculiarity allows us to train the attention weights by solving the standard quadratic optimization problem with linear constraints. The attention weights also depend on the discount factor as a tuning parameter, which determines how much the impact of the weight is decreased with the number of iterations. Numerical experiments performed for two types of base learners, original decision trees and extremely randomized trees with various regression datasets illustrate the proposed model.  ( 2 min )
    PAC Reinforcement Learning for Predictive State Representations. (arXiv:2207.05738v1 [cs.LG])
    In this paper we study online Reinforcement Learning (RL) in partially observable dynamical systems. We focus on the Predictive State Representations (PSRs) model, which is an expressive model that captures other well-known models such as Partially Observable Markov Decision Processes (POMDP). PSR represents the states using a set of predictions of future observations and is defined entirely using observable quantities. We develop a novel model-based algorithm for PSRs that can learn a near optimal policy in sample complexity scaling polynomially with respect to all the relevant parameters of the systems. Our algorithm naturally works with function approximation to extend to systems with potentially large state and observation spaces. We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces. Notably, our work is the first work that shows polynomial sample complexities to compete with the globally optimal policy in PSRs. Finally, we demonstrate how our general theorem can be directly used to derive sample complexity bounds for special models including $m$-step weakly revealing and $m$-step decodable tabular POMDPs, POMDPs with low-rank latent transition, and POMDPs with linear emission and latent transition.  ( 2 min )
    Improved Batching Strategy For Irregular Time-Series ODE. (arXiv:2207.05708v1 [cs.LG])
    Irregular time series data are prevalent in the real world and are challenging to model with a simple recurrent neural network (RNN). Hence, a model that combines the use of ordinary differential equations (ODE) and RNN was proposed (ODE-RNN) to model irregular time series with higher accuracy, but it suffers from high computational costs. In this paper, we propose an improvement in the runtime on ODE-RNNs by using a different efficient batching strategy. Our experiments show that the new models reduce the runtime of ODE-RNN significantly ranging from 2 times up to 49 times depending on the irregularity of the data while maintaining comparable accuracy. Hence, our model can scale favorably for modeling larger irregular data sets.  ( 2 min )
    Machine Learning model for gas-liquid interface reconstruction in CFD numerical simulations. (arXiv:2207.05684v1 [physics.flu-dyn])
    The volume of fluid (VoF) method is widely used in multi-phase flow simulations to track and locate the interface between two immiscible fluids. A major bottleneck of the VoF method is the interface reconstruction step due to its high computational cost and low accuracy on unstructured grids. We propose a machine learning enhanced VoF method based on Graph Neural Networks (GNN) to accelerate the interface reconstruction on general unstructured meshes. We first develop a methodology to generate a synthetic dataset based on paraboloid surfaces discretized on unstructured meshes. We then train a GNN based model and perform generalization tests. Our results demonstrate the efficiency of a GNN based approach for interface reconstruction in multi-phase flow simulations in the industrial context.  ( 2 min )
    Bayesian Experimental Design for Computed Tomography with the Linearised Deep Image Prior. (arXiv:2207.05714v1 [cs.CV])
    We investigate adaptive design based on a single sparse pilot scan for generating effective scanning strategies for computed tomography reconstruction. We propose a novel approach using the linearised deep image prior. It allows incorporating information from the pilot measurements into the angle selection criteria, while maintaining the tractability of a conjugate Gaussian-linear model. On a synthetically generated dataset with preferential directions, linearised DIP design allows reducing the number of scans by up to 30% relative to an equidistant angle baseline.  ( 2 min )
    HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle. (arXiv:2207.05477v1 [cs.DC])
    Accurate protein structure prediction can significantly accelerate the development of life science. The accuracy of AlphaFold2, a frontier end-to-end structure prediction system, is already close to that of the experimental determination techniques. Due to the complex model architecture and large memory consumption, it requires lots of computational resources and time to implement the training and inference of AlphaFold2 from scratch. The cost of running the original AlphaFold2 is expensive for most individuals and institutions. Therefore, reducing this cost could accelerate the development of life science. We implement AlphaFold2 using PaddlePaddle, namely HelixFold, to improve training and inference speed and reduce memory consumption. The performance is improved by operator fusion, tensor fusion, and hybrid parallelism computation, while the memory is optimized through Recompute, BFloat16, and memory read/write in-place. Compared with the original AlphaFold2 (implemented by Jax) and OpenFold (implemented by PyTorch), HelixFold needs only 7.5 days to complete the full end-to-end training and only 5.3 days when using hybrid parallelism, while both AlphaFold2 and OpenFold take about 11 days. HelixFold saves 1x training time. We verified that HelixFold's accuracy could be on par with AlphaFold2 on the CASP14 and CAMEO datasets. HelixFold's code is available on GitHub for free download: https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein/forecast.  ( 3 min )
    A Machine Learning Data Fusion Model for Soil Moisture Retrieval. (arXiv:2206.09649v2 [physics.ao-ph] UPDATED)
    We develop a deep learning based convolutional-regression model that estimates the volumetric soil moisture content in the top ~5 cm of soil. Input predictors include Sentinel-1 (active radar), Sentinel-2 (optical imagery), and SMAP (passive radar) as well as geophysical variables from SoilGrids and modelled soil moisture fields from GLDAS. The model was trained and evaluated on data from ~1300 in-situ sensors globally over the period 2015 - 2021 and obtained an average per-sensor correlation of 0.727 and ubRMSE of 0.054, and can be used to produce a soil moisture map at a nominal 320m resolution. These results are benchmarked against 13 other soil moisture works at different locations, and an ablation study was used to identify important predictors.  ( 2 min )
    Using Interpretable Machine Learning to Predict Maternal and Fetal Outcomes. (arXiv:2207.05322v1 [cs.LG])
    Most pregnancies and births result in a good outcome, but complications are not uncommon and when they do occur, they can be associated with serious implications for mothers and babies. Predictive modeling has the potential to improve outcomes through better understanding of risk factors, heightened surveillance, and more timely and appropriate interventions, thereby helping obstetricians deliver better care. For three types of complications we identify and study the most important risk factors using Explainable Boosting Machine (EBM), a glass box model, in order to gain intelligibility: (i) Severe Maternal Morbidity (SMM), (ii) shoulder dystocia, and (iii) preterm preeclampsia. While using the interpretability of EBM's to reveal surprising insights into the features contributing to risk, our experiments show EBMs match the accuracy of other black-box ML methods such as deep neural nets and random forests.  ( 2 min )
    RE-Tagger: A light-weight Real-Estate Image Classifier. (arXiv:2207.05696v1 [cs.CV])
    Real-estate image tagging is one of the essential use-cases to save efforts involved in manual annotation and enhance the user experience. This paper proposes an end-to-end pipeline (referred to as RE-Tagger) for the real-estate image classification problem. We present a two-stage transfer learning approach using custom InceptionV3 architecture to classify images into different categories (i.e., bedroom, bathroom, kitchen, balcony, hall, and others). Finally, we released the application as REST API hosted as a web application running on 2 cores machine with 2 GB RAM. The demo video is available here.  ( 2 min )
    Latent Variable Models for Bayesian Causal Discovery. (arXiv:2207.05723v1 [cs.LG])
    Learning predictors that do not rely on spurious correlations involves building causal representations. However, learning such a representation is very challenging. We, therefore, formulate the problem of learning a causal representation from high dimensional data and study causal recovery with synthetic data. This work introduces a latent variable decoder model, Decoder BCD, for Bayesian causal discovery and performs experiments in mildly supervised and unsupervised settings. We present a series of synthetic experiments to characterize important factors for causal discovery and show that using known intervention targets as labels helps in unsupervised Bayesian inference over structure and parameters of linear Gaussian additive noise latent structural causal models.  ( 2 min )
    EfficientLEAF: A Faster LEarnable Audio Frontend of Questionable Use. (arXiv:2207.05508v1 [cs.SD])
    In audio classification, differentiable auditory filterbanks with few parameters cover the middle ground between hard-coded spectrograms and raw audio. LEAF (arXiv:2101.08596), a Gabor-based filterbank combined with Per-Channel Energy Normalization (PCEN), has shown promising results, but is computationally expensive. With inhomogeneous convolution kernel sizes and strides, and by replacing PCEN with better parallelizable operations, we can reach similar results more efficiently. In experiments on six audio classification tasks, our frontend matches the accuracy of LEAF at 3% of the cost, but both fail to consistently outperform a fixed mel filterbank. The quest for learnable audio frontends is not solved.  ( 2 min )
    Investigating the Impact of Independent Rule Fitnesses in a Learning Classifier System. (arXiv:2207.05582v1 [cs.LG])
    Achieving at least some level of explainability requires complex analyses for many machine learning systems, such as common black-box models. We recently proposed a new rule-based learning system, SupRB, to construct compact, interpretable and transparent models by utilizing separate optimizers for the model selection tasks concerning rule discovery and rule set composition.This allows users to specifically tailor their model structure to fulfil use-case specific explainability requirements. From an optimization perspective, this allows us to define clearer goals and we find that -- in contrast to many state of the art systems -- this allows us to keep rule fitnesses independent. In this paper we investigate this system's performance thoroughly on a set of regression problems and compare it against XCSF, a prominent rule-based learning system. We find the overall results of SupRB's evaluation comparable to XCSF's while allowing easier control of model structure and showing a substantially smaller sensitivity to random seeds and data splits. This increased control can aid in subsequently providing explanations for both training and final structure of the model.  ( 2 min )
    Utilizing Excess Resources in Training Neural Networks. (arXiv:2207.05532v1 [cs.LG])
    In this work, we suggest Kernel Filtering Linear Overparameterization (KFLO), where a linear cascade of filtering layers is used during training to improve network performance in test time. We implement this cascade in a kernel filtering fashion, which prevents the trained architecture from becoming unnecessarily deeper. This also allows using our approach with almost any network architecture and let combining the filtering layers into a single layer in test time. Thus, our approach does not add computational complexity during inference. We demonstrate the advantage of KFLO on various network models and datasets in supervised learning.  ( 2 min )
    Long Short-Term Memory to predict 3D Amino acids Positions in GPCR Molecular Dynamics. (arXiv:2207.05682v1 [q-bio.BM])
    G-Protein Coupled Receptors (GPCRs) are a big family of eukaryotic cell transmembrane proteins, responsible for numerous biological processes. From a practical viewpoint around 34\% of the drugs approved by the US Food and Drug Administration target these receptors. They can be analyzed from their simulated molecular dynamics, including the prediction of their behavior in the presence of drugs. In this paper, the capability of Long Short-Term Memory Networks (LSTMs) are evaluated to learn and predict the molecular dynamic trajectories of a receptor. Several models were trained with the 3D position of the amino acids of the receptor considering different transformations on the position of the amino acid, such as their centers of mass, the geometric centers and the position of the $\alpha$--carbon for each amino acid. The error of the prediction of the position was evaluated by the mean average error (MAE) and root-mean-square deviation (RMSD). The LSTM models show a robust performance, with results comparable to the state-of-the-art in non-dynamic 3D predictions. The best MAE and RMSD values were found for the mass center of the amino acids with 0.078 {\AA} and 0.156 {\AA} respectively. This work shows the potential of LSTM to predict the molecular dynamics of GPRCs.  ( 2 min )
    Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets. (arXiv:2202.01671v2 [stat.ML] UPDATED)
    The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets. We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples. Existing methods typically assume known data alignment and compare such operators in a pointwise manner. Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric. Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities. Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains.  ( 2 min )
    An Introduction to Lifelong Supervised Learning. (arXiv:2207.04354v2 [cs.LG] UPDATED)
    This primer is an attempt to provide a detailed summary of the different facets of lifelong learning. We start with Chapter 2 which provides a high-level overview of lifelong learning systems. In this chapter, we discuss prominent scenarios in lifelong learning (Section 2.4), provide 8 Introduction a high-level organization of different lifelong learning approaches (Section 2.5), enumerate the desiderata for an ideal lifelong learning system (Section 2.6), discuss how lifelong learning is related to other learning paradigms (Section 2.7), describe common metrics used to evaluate lifelong learning systems (Section 2.8). This chapter is more useful for readers who are new to lifelong learning and want to get introduced to the field without focusing on specific approaches or benchmarks. The remaining chapters focus on specific aspects (either learning algorithms or benchmarks) and are more useful for readers who are looking for specific approaches or benchmarks. Chapter 3 focuses on regularization-based approaches that do not assume access to any data from previous tasks. Chapter 4 discusses memory-based approaches that typically use a replay buffer or an episodic memory to save subset of data across different tasks. Chapter 5 focuses on different architecture families (and their instantiations) that have been proposed for training lifelong learning systems. Following these different classes of learning algorithms, we discuss the commonly used evaluation benchmarks and metrics for lifelong learning (Chapter 6) and wrap up with a discussion of future challenges and important research directions in Chapter 7.
    Tracking Objects as Pixel-wise Distributions. (arXiv:2207.05518v1 [cs.CV])
    Multi-object tracking (MOT) requires detecting and associating objects through frames. Unlike tracking via detected bounding boxes or tracking objects as points, we propose tracking objects as pixel-wise distributions. We instantiate this idea on a transformer-based architecture, P3AFormer, with pixel-wise propagation, prediction, and association. P3AFormer propagates pixel-wise features guided by flow information to pass messages between frames. Furthermore, P3AFormer adopts a meta-architecture to produce multi-scale object feature maps. During inference, a pixel-wise association procedure is proposed to recover object connections through frames based on the pixel-wise prediction. P3AFormer yields 81.2\% in terms of MOTA on the MOT17 benchmark -- the first among all transformer networks to reach 80\% MOTA in literature. P3AFormer also outperforms state-of-the-arts on the MOT20 and KITTI benchmarks.
    Robustness and Personalization in Federated Learning: A Unified Approach via Regularization. (arXiv:2009.06303v3 [cs.LG] UPDATED)
    We present a class of methods for robust, personalized federated learning, called Fed+, that unifies many federated learning algorithms. The principal advantage of this class of methods is to better accommodate the real-world characteristics found in federated training, such as the lack of IID data across parties, the need for robustness to outliers or stragglers, and the requirement to perform well on party-specific datasets. We achieve this through a problem formulation that allows the central server to employ robust ways of aggregating the local models while keeping the structure of local computation intact. Without making any statistical assumption on the degree of heterogeneity of local data across parties, we provide convergence guarantees for Fed+ for convex and non-convex loss functions under different (robust) aggregation methods. The Fed+ theory is also equipped to handle heterogeneous computing environments including stragglers without additional assumptions; specifically, the convergence results cover the general setting where the number of local update steps across parties can vary. We demonstrate the benefits of Fed+ through extensive experiments across standard benchmark datasets.
    Autotelic Agents with Intrinsically Motivated Goal-Conditioned Reinforcement Learning: a Short Survey. (arXiv:2012.09830v7 [cs.LG] UPDATED)
    Building autonomous machines that can explore open-ended environments, discover possible interactions and build repertoires of skills is a general objective of artificial intelligence. Developmental approaches argue that this can only be achieved by $autotelic$ $agents$: intrinsically motivated learning agents that can learn to represent, generate, select and solve their own problems. In recent years, the convergence of developmental approaches with deep reinforcement learning (RL) methods has been leading to the emergence of a new field: $developmental$ $reinforcement$ $learning$. Developmental RL is concerned with the use of deep RL algorithms to tackle a developmental problem -- the $intrinsically$ $motivated$ $acquisition$ $of$ $open$-$ended$ $repertoires$ $of$ $skills$. The self-generation of goals requires the learning of compact goal encodings as well as their associated goal-achievement functions. This raises new challenges compared to standard RL algorithms originally designed to tackle pre-defined sets of goals using external reward signals. The present paper introduces developmental RL and proposes a computational framework based on goal-conditioned RL to tackle the intrinsically motivated skills acquisition problem. It proceeds to present a typology of the various goal representations used in the literature, before reviewing existing methods to learn to represent and prioritize goals in autonomous systems. We finally close the paper by discussing some open challenges in the quest of intrinsically motivated skills acquisition.
    Wasserstein multivariate auto-regressive models for modeling distributional time series and its application in graph learning. (arXiv:2207.05442v1 [stat.ML])
    We propose a new auto-regressive model for the statistical analysis of multivariate distributional time series. The data of interest consist of a collection of multiple series of probability measures supported over a bounded interval of the real line, and that are indexed by distinct time instants. The probability measures are modelled as random objects in the Wasserstein space. We establish the auto-regressive model in the tangent space at the Lebesgue measure by first centering all the raw measures so that their Fr\'echet means turn to be the Lebesgue measure. Using the theory of iterated random function systems, results on the existence, uniqueness and stationarity of the solution of such a model are provided. We also propose a consistent estimator for the model coefficient. In addition to the analysis of simulated data, the proposed model is illustrated with two real data sets made of observations from age distribution in different countries and bike sharing network in Paris. Finally, due to the positive and boundedness constraints that we impose on the model coefficients, the proposed estimator that is learned under these constraints, naturally has a sparse structure. The sparsity allows furthermore the application of the proposed model in learning a graph of temporal dependency from the multivariate distributional time series.
    Zero-Shot Machine Unlearning. (arXiv:2201.05629v2 [cs.LG] UPDATED)
    Modern privacy regulations grant citizens the right to be forgotten by products, services and companies. In case of machine learning (ML) applications, this necessitates deletion of data not only from storage archives but also from ML models. Due to an increasing need for regulatory compliance required for ML applications, machine unlearning is becoming an emerging research problem. The right to be forgotten requests come in the form of removal of a certain set or class of data from the already trained ML model. Practical considerations preclude retraining of the model from scratch minus the deleted data. The few existing studies use either the whole training data, or a subset of training data, or some metadata stored during training to update the model weights for unlearning. However, strict regulatory compliance requires time-bound deletion of data. Thus, in many cases, no data related to the training process or training samples may be accessible even for the unlearning purpose. We therefore ask the question: is it possible to achieve unlearning with zero training samples? In this paper, we introduce the novel problem of zero-shot machine unlearning that caters for the extreme but practical scenario where zero original data samples are available for use. We then propose two novel solutions for zero-shot machine unlearning based on (a) error minimizing-maximizing noise and (b) gated knowledge transfer. These methods remove the information of the forget data from the model while maintaining the model efficacy on the retain data. The zero-shot approach offers good protection against the model inversion attacks and membership inference attacks. We introduce a new evaluation metric, Anamnesis Index (AIN) to effectively measure the quality of the unlearning method. The experiments show promising results for unlearning in deep learning models on benchmark vision data-sets.
    CGMN: A Contrastive Graph Matching Network for Self-Supervised Graph Similarity Learning. (arXiv:2205.15083v2 [cs.LG] UPDATED)
    Graph similarity learning refers to calculating the similarity score between two graphs, which is required in many realistic applications, such as visual tracking, graph classification, and collaborative filtering. As most of the existing graph neural networks yield effective graph representations of a single graph, little effort has been made for jointly learning two graph representations and calculating their similarity score. In addition, existing unsupervised graph similarity learning methods are mainly clustering-based, which ignores the valuable information embodied in graph pairs. To this end, we propose a contrastive graph matching network (CGMN) for self-supervised graph similarity learning in order to calculate the similarity between any two input graph objects. Specifically, we generate two augmented views for each graph in a pair respectively. Then, we employ two strategies, namely cross-view interaction and cross-graph interaction, for effective node representation learning. The former is resorted to strengthen the consistency of node representations in two views. The latter is utilized to identify node differences between different graphs. Finally, we transform node representations into graph-level representations via pooling operations for graph similarity computation. We have evaluated CGMN on eight real-world datasets, and the experiment results show that the proposed new approach is superior to the state-of-the-art methods in graph similarity learning downstream tasks.
    Physical Passive Patch Adversarial Attacks on Visual Odometry Systems. (arXiv:2207.05729v1 [cs.CV])
    Deep neural networks are known to be susceptible to adversarial perturbations -- small perturbations that alter the output of the network and exist under strict norm limitations. While such perturbations are usually discussed as tailored to a specific input, a universal perturbation can be constructed to alter the model's output on a set of inputs. Universal perturbations present a more realistic case of adversarial attacks, as awareness of the model's exact input is not required. In addition, the universal attack setting raises the subject of generalization to unseen data, where given a set of inputs, the universal perturbations aim to alter the model's output on out-of-sample data. In this work, we study physical passive patch adversarial attacks on visual odometry-based autonomous navigation systems. A visual odometry system aims to infer the relative camera motion between two corresponding viewpoints, and is frequently used by vision-based autonomous navigation systems to estimate their state. For such navigation systems, a patch adversarial perturbation poses a severe security issue, as it can be used to mislead a system onto some collision course. To the best of our knowledge, we show for the first time that the error margin of a visual odometry model can be significantly increased by deploying patch adversarial attacks in the scene. We provide evaluation on synthetic closed-loop drone navigation data and demonstrate that a comparable vulnerability exists in real data. A reference implementation of the proposed method and the reported experiments is provided at https://github.com/patchadversarialattacks/patchadversarialattacks.
    Asteroid Flyby Cycler Trajectory Design Using Deep Neural Networks. (arXiv:2111.11858v3 [astro-ph.IM] UPDATED)
    Asteroid exploration has been attracting more attention in recent years. Nevertheless, we have just visited tens of asteroids while we have discovered more than one million bodies. As our current observation and knowledge should be biased, it is essential to explore multiple asteroids directly to better understand the remains of planetary building materials. One of the mission design solutions is utilizing asteroid flyby cycler trajectories with multiple Earth gravity assists. An asteroid flyby cycler trajectory design problem is a subclass of global trajectory optimization problems with multiple flybys, involving a trajectory optimization problem for a given flyby sequence and a combinatorial optimization problem to decide the sequence of the flybys. As the number of flyby bodies grows, the computation time of this optimization problem expands maliciously. This paper presents a new method to design asteroid flyby cycler trajectories utilizing a surrogate model constructed by deep neural networks approximating trajectory optimization results. Since one of the bottlenecks of machine learning approaches is the computation time to generate massive trajectory databases, we propose an efficient database generation strategy by introducing pseudo-asteroids satisfying the Karush-Kuhn-Tucker conditions. The numerical result applied to JAXA's DESTINY+ mission shows that the proposed method is practically applicable to space mission design and can significantly reduce the computational time for searching asteroid flyby sequences.
    Deep Metric Learning-Based Semi-Supervised Regression With Alternate Learning. (arXiv:2202.11388v2 [cs.CV] UPDATED)
    This paper introduces a novel deep metric learning-based semi-supervised regression (DML-S2R) method for parameter estimation problems. The proposed DML-S2R method aims to mitigate the problems of insufficient amount of labeled samples without collecting any additional sample with a target value. To this end, it is made up of two main steps: i) pairwise similarity modeling with scarce labeled data; and ii) triplet-based metric learning with abundant unlabeled data. The first step aims to model pairwise sample similarities by using a small number of labeled samples. This is achieved by estimating the target value differences of labeled samples with a Siamese neural network (SNN). The second step aims to learn a triplet-based metric space (in which similar samples are close to each other and dissimilar samples are far apart from each other) when the number of labeled samples is insufficient. This is achieved by employing the SNN of the first step for triplet-based deep metric learning that exploits not only labeled samples but also unlabeled samples. For the end-to-end training of DML-S2R, we investigate an alternate learning strategy for the two steps. Due to this strategy, the encoded information in each step becomes a guidance for learning phase of the other step. The experimental results confirm the success of DML-S2R compared to the state-of-the-art semi-supervised regression methods. The code of the proposed method is publicly available at https://git.tu-berlin.de/rsim/DML-S2R.
    Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders. (arXiv:2203.12742v2 [cs.LG] UPDATED)
    Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce new tasks optimizing \emph{in silico} and \emph{in vitro} properties of large-molecule fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.
    Docent: A content-based recommendation system to discover contemporary art. (arXiv:2207.05648v1 [cs.LG])
    Recommendation systems have been widely used in various domains such as music, films, e-shopping etc. After mostly avoiding digitization, the art world has recently reached a technological turning point due to the pandemic, making online sales grow significantly as well as providing quantitative online data about artists and artworks. In this work, we present a content-based recommendation system on contemporary art relying on images of artworks and contextual metadata of artists. We gathered and annotated artworks with advanced and art-specific information to create a completely unique database that was used to train our models. With this information, we built a proximity graph between artworks. Similarly, we used NLP techniques to characterize the practices of the artists and we extracted information from exhibitions and other event history to create a proximity graph between artists. The power of graph analysis enables us to provide an artwork recommendation system based on a combination of visual and contextual information from artworks and artists. After an assessment by a team of art specialists, we get an average final rating of 75% of meaningful artworks when compared to their professional evaluations.
    From Spectral Graph Convolutions to Large Scale Graph Convolutional Networks. (arXiv:2207.05669v1 [cs.LG])
    Graph Convolutional Networks (GCNs) have been shown to be a powerful concept that has been successfully applied to a large variety of tasks across many domains over the past years. In this work we study the theory that paved the way to the definition of GCN, including related parts of classical graph theory. We also discuss and experimentally demonstrate key properties and limitations of GCNs such as those caused by the statistical dependency of samples, introduced by the edges of the graph, which causes the estimates of the full gradient to be biased. Another limitation we discuss is the negative impact of minibatch sampling on the model performance. As a consequence, during parameter update, gradients are computed on the whole dataset, undermining scalability to large graphs. To account for this, we research alternative methods which allow to safely learn good parameters while sampling only a subset of data per iteration. We reproduce the results reported in the work of Kipf et al. and propose an implementation inspired to SIGN, which is a sampling-free minibatch method. Eventually we compare the two implementations on a benchmark dataset, proving that they are comparable in terms of prediction accuracy for the task of semi-supervised node classification.
    CompoundE: Knowledge Graph Embedding with Translation, Rotation and Scaling Compound Operations. (arXiv:2207.05324v1 [cs.AI])
    Translation, rotation, and scaling are three commonly used geometric manipulation operations in image processing. Besides, some of them are successfully used in developing effective knowledge graph embedding (KGE) models such as TransE and RotatE. Inspired by the synergy, we propose a new KGE model by leveraging all three operations in this work. Since translation, rotation, and scaling operations are cascaded to form a compound one, the new model is named CompoundE. By casting CompoundE in the framework of group theory, we show that quite a few scoring-function-based KGE models are special cases of CompoundE. CompoundE extends the simple distance-based relation to relation-dependent compound operations on head and/or tail entities. To demonstrate the effectiveness of CompoundE, we conduct experiments on three popular KG completion datasets. Experimental results show that CompoundE consistently achieves the state of-the-art performance.
    Modern Views of Machine Learning for Precision Psychiatry. (arXiv:2204.01607v2 [cs.LG] UPDATED)
    In light of the NIMH's Research Domain Criteria (RDoC), the advent of functional neuroimaging, novel technologies and methods provide new opportunities to develop precise and personalized prognosis and diagnosis of mental disorders. Machine learning (ML) and artificial intelligence (AI) technologies are playing an increasingly critical role in the new era of precision psychiatry. Combining ML/AI with neuromodulation technologies can potentially provide explainable solutions in clinical practice and effective therapeutic treatment. Advanced wearable and mobile technologies also call for the new role of ML/AI for digital phenotyping in mobile mental health. In this review, we provide a comprehensive review of the ML methodologies and applications by combining neuroimaging, neuromodulation, and advanced mobile technologies in psychiatry practice. Additionally, we review the role of ML in molecular phenotyping and cross-species biomarker identification in precision psychiatry. We further discuss explainable AI (XAI) and causality testing in a closed-human-in-the-loop manner, and highlight the ML potential in multimedia information extraction and multimodal data fusion. Finally, we discuss conceptual and practical challenges in precision psychiatry and highlight ML opportunities in future research.
    Uniform Manifold Approximation with Two-phase Optimization. (arXiv:2205.00420v2 [cs.LG] UPDATED)
    We introduce Uniform Manifold Approximation with Two-phase Optimization (UMATO), a dimensionality reduction (DR) technique that improves UMAP to capture the global structure of high-dimensional data more accurately. In UMATO, optimization is divided into two phases so that the resulting embeddings can depict the global structure reliably while preserving the local structure with sufficient accuracy. In the first phase, hub points are identified and projected to construct a skeletal layout for the global structure. In the second phase, the remaining points are added to the embedding preserving the regional characteristics of local areas. Through quantitative experiments, we found that UMATO (1) outperformed widely used DR techniques in preserving the global structure while (2) producing competitive accuracy in representing the local structure. We also verified that UMATO is preferable in terms of robustness over diverse initialization methods, number of epochs, and subsampling techniques.
    Horizontal Federated Learning and Secure Distributed Training for Recommendation System with Intel SGX. (arXiv:2207.05079v1 [cs.LG])
    With the advent of big data era and the development of artificial intelligence and other technologies, data security and privacy protection have become more important. Recommendation systems have many applications in our society, but the model construction of recommendation systems is often inseparable from users' data. Especially for deep learning-based recommendation systems, due to the complexity of the model and the characteristics of deep learning itself, its training process not only requires long training time and abundant computational resources but also needs to use a large amount of user data, which poses a considerable challenge in terms of data security and privacy protection. How to train a distributed recommendation system while ensuring data security has become an urgent problem to be solved. In this paper, we implement two schemes, Horizontal Federated Learning and Secure Distributed Training, based on Intel SGX(Software Guard Extensions), an implementation of a trusted execution environment, and TensorFlow framework, to achieve secure, distributed recommendation system-based learning schemes in different scenarios. We experiment on the classical Deep Learning Recommendation Model (DLRM), which is a neural network-based machine learning model designed for personalization and recommendation, and the results show that our implementation introduces approximately no loss in model performance. The training speed is within acceptable limits.
    TabSynDex: A Universal Metric for Robust Evaluation of Synthetic Tabular Data. (arXiv:2207.05295v1 [cs.LG])
    Synthetic tabular data generation becomes crucial when real data is limited, expensive to collect, or simply cannot be used due to privacy concerns. However, producing good quality synthetic data is challenging. Several probabilistic, statistical, and generative adversarial networks (GANs) based approaches have been presented for synthetic tabular data generation. Once generated, evaluating the quality of the synthetic data is quite challenging. Some of the traditional metrics have been used in the literature but there is lack of a common, robust, and single metric. This makes it difficult to properly compare the effectiveness of different synthetic tabular data generation methods. In this paper we propose a new universal metric, TabSynDex, for robust evaluation of synthetic data. TabSynDex assesses the similarity of synthetic data with real data through different component scores which evaluate the characteristics that are desirable for "high quality" synthetic data. Being a single score metric, TabSynDex can also be used to observe and evaluate the training of neural network based approaches. This would help in obtaining insights that was not possible earlier. Further, we present several baseline models for comparative analysis of the proposed evaluation metric with existing generative models.
    Uncertainty-Aware Learning Against Label Noise on Imbalanced Datasets. (arXiv:2207.05471v1 [stat.ML])
    Learning against label noise is a vital topic to guarantee a reliable performance for deep neural networks. Recent research usually refers to dynamic noise modeling with model output probabilities and loss values, and then separates clean and noisy samples. These methods have gained notable success. However, unlike cherry-picked data, existing approaches often cannot perform well when facing imbalanced datasets, a common scenario in the real world. We thoroughly investigate this phenomenon and point out two major issues that hinder the performance, i.e., \emph{inter-class loss distribution discrepancy} and \emph{misleading predictions due to uncertainty}. The first issue is that existing methods often perform class-agnostic noise modeling. However, loss distributions show a significant discrepancy among classes under class imbalance, and class-agnostic noise modeling can easily get confused with noisy samples and samples in minority classes. The second issue refers to that models may output misleading predictions due to epistemic uncertainty and aleatoric uncertainty, thus existing methods that rely solely on the output probabilities may fail to distinguish confident samples. Inspired by our observations, we propose an Uncertainty-aware Label Correction framework~(ULC) to handle label noise on imbalanced datasets. First, we perform epistemic uncertainty-aware class-specific noise modeling to identify trustworthy clean samples and refine/discard highly confident true/corrupted labels. Then, we introduce aleatoric uncertainty in the subsequent learning process to prevent noise accumulation in the label noise modeling process. We conduct experiments on several synthetic and real-world datasets. The results demonstrate the effectiveness of the proposed method, especially on imbalanced datasets.
    Practical Attacks on Machine Learning: A Case Study on Adversarial Windows Malware. (arXiv:2207.05548v1 [cs.CR])
    While machine learning is vulnerable to adversarial examples, it still lacks systematic procedures and tools for evaluating its security in different application contexts. In this article, we discuss how to develop automated and scalable security evaluations of machine learning using practical attacks, reporting a use case on Windows malware detection.
    Efficient and Privacy Preserving Group Signature for Federated Learning. (arXiv:2207.05297v1 [cs.CR])
    Federated Learning (FL) is a Machine Learning (ML) technique that aims to reduce the threats to user data privacy. Training is done using the raw data on the users' device, called clients, and only the training results, called gradients, are sent to the server to be aggregated and generate an updated model. However, we cannot assume that the server can be trusted with private information, such as metadata related to the owner or source of the data. So, hiding the client information from the server helps reduce privacy-related attacks. Therefore, the privacy of the client's identity, along with the privacy of the client's data, is necessary to make such attacks more difficult. This paper proposes an efficient and privacy-preserving protocol for FL based on group signature. A new group signature for federated learning, called GSFL, is designed to not only protect the privacy of the client's data and identity but also significantly reduce the computation and communication costs considering the iterative process of federated learning. We show that GSFL outperforms existing approaches in terms of computation, communication, and signaling costs. Also, we show that the proposed protocol can handle various security attacks in the federated learning environment.
    Quantum Neural Network Classifiers: A Tutorial. (arXiv:2206.02806v2 [quant-ph] UPDATED)
    Machine learning has achieved dramatic success over the past decade, with applications ranging from face recognition to natural language processing. Meanwhile, rapid progress has been made in the field of quantum computation including developing both powerful quantum algorithms and advanced quantum devices. The interplay between machine learning and quantum physics holds the intriguing potential for bringing practical applications to the modern society. Here, we focus on quantum neural networks in the form of parameterized quantum circuits. We will mainly discuss different structures and encoding strategies of quantum neural networks for supervised learning tasks, and benchmark their performance utilizing Yao.jl, a quantum simulation package written in Julia Language. The codes are efficient, aiming to provide convenience for beginners in scientific works such as developing powerful variational quantum learning models and assisting the corresponding experimental demonstrations.
    A Baseline for Detecting Out-of-Distribution Examples in Image Captioning. (arXiv:2207.05418v1 [cs.CV])
    Image captioning research achieved breakthroughs in recent years by developing neural models that can generate diverse and high-quality descriptions for images drawn from the same distribution as training images. However, when facing out-of-distribution (OOD) images, such as corrupted images, or images containing unknown objects, the models fail in generating relevant captions. In this paper, we consider the problem of OOD detection in image captioning. We formulate the problem and suggest an evaluation setup for assessing the model's performance on the task. Then, we analyze and show the effectiveness of the caption's likelihood score at detecting and rejecting OOD images, which implies that the relatedness between the input image and the generated caption is encapsulated within the score.
    Cognition in Dynamical Systems, Second Edition. (arXiv:1805.00787v2 [cs.MA] UPDATED)
    Cognition is the process of knowing. As carried out by a dynamical system, it is the process by which the system absorbs information into its state. A complex network of agents cognizes knowledge about its environment, internal dynamics and initial state by forming emergent, macro-level patterns. Such patterns require each agent to find its place while partially aware of the whole pattern. Such partial awareness can be achieved by separating the system dynamics into two parts by timescale: the propagation dynamics and the pattern dynamics. The fast propagation dynamics describe the spread of signals across the network. If they converge to a fixed point for any quasi-static state of the slow pattern dynamics, that fixed point represents an aggregate of macro-level information. On longer timescales, agents coordinate via positive feedback to form patterns, which are defined using closed walks in the graph of agents. Patterns can be coherent, in that every part of the pattern depends on every other part for context. Coherent patterns are acausal, in that (a) they cannot be predicted and (b) no part of the stored knowledge can be mapped to any part of the pattern, or vice versa. A cognitive network's knowledge is encoded or embodied by the selection of patterns which emerge. The theory of cognition summarized here can model autocatalytic reaction-diffusion systems, artificial neural networks, market economies and ant colony optimization, among many other real and virtual systems. This theory suggests a new understanding of complexity as a lattice of contexts rather than a single measure.
    Prediction of Maneuvering Status for Aerial Vehicles using Supervised Learning Methods. (arXiv:2206.10303v2 [cs.RO] UPDATED)
    Aerial Vehicles follow a guided approach based on Latitude, Longitude and Altitude. This information can be used for calculating the status of maneuvering for the aerial vehicles along the line of trajectory. This is a binary classification problem and Machine Learning can be leveraged for solving such problem. In this paper we present a methodology for deriving maneuvering status and its prediction using Linear, Distance Metric, Discriminant Analysis and Boosting Ensemble supervised learning methods. We provide various metrics along the line in the results section that give condensed comparison of the appropriate algorithm for prediction of the maneuvering status.
    WeShort: Out-of-distribution Detection With Weak Shortcut structure. (arXiv:2207.05055v1 [cs.LG])
    Neural networks have achieved impressive performance for data in the distribution which is the same as the training set but can produce an overconfident incorrect result for the data these networks have never seen. Therefore, it is essential to detect whether inputs come from out-of-distribution(OOD) in order to guarantee the safety of neural networks deployed in the real world. In this paper, we propose a simple and effective post-hoc technique, WeShort, to reduce the overconfidence of neural networks on OOD data. Our method is inspired by the observation of the internal residual structure, which shows the separation of the OOD and in-distribution (ID) data in the shortcut layer. Our method is compatible with different OOD detection scores and can generalize well to different architectures of networks. We demonstrate our method on various OOD datasets to show its competitive performances and provide reasonable hypotheses to explain why our method works. On the ImageNet benchmark, Weshort achieves state-of-the-art performance on the false positive rate (FPR95) and the area under the receiver operating characteristic (AUROC) on the family of post-hoc methods.
    BASED-XAI: Breaking Ablation Studies Down for Explainable Artificial Intelligence. (arXiv:2207.05566v1 [cs.LG])
    Explainable artificial intelligence (XAI) methods lack ground truth. In its place, method developers have relied on axioms to determine desirable properties for their explanations' behavior. For high stakes uses of machine learning that require explainability, it is not sufficient to rely on axioms as the implementation, or its usage, can fail to live up to the ideal. As a result, there exists active research on validating the performance of XAI methods. The need for validation is especially magnified in domains with a reliance on XAI. A procedure frequently used to assess their utility, and to some extent their fidelity, is an ablation study. By perturbing the input variables in rank order of importance, the goal is to assess the sensitivity of the model's performance. Perturbing important variables should correlate with larger decreases in measures of model capability than perturbing less important features. While the intent is clear, the actual implementation details have not been studied rigorously for tabular data. Using five datasets, three XAI methods, four baselines, and three perturbations, we aim to show 1) how varying perturbations and adding simple guardrails can help to avoid potentially flawed conclusions, 2) how treatment of categorical variables is an important consideration in both post-hoc explainability and ablation studies, and 3) how to identify useful baselines for XAI methods and viable perturbations for ablation studies.
    "That's so cute!": The CARE Dataset for Affective Response Detection. (arXiv:2201.11895v2 [cs.LG] UPDATED)
    Social media plays an increasing role in our communication with friends and family, and our consumption of information and entertainment. Hence, to design effective ranking functions for posts on social media, it would be useful to predict the affective response to a post (e.g., whether the user is likely to be humored, inspired, angered, informed). Similar to work on emotion recognition (which focuses on the affect of the publisher of the post), the traditional approach to recognizing affective response would involve an expensive investment in human annotation of training data. We introduce CARE$_{db}$, a dataset of 230k social media posts annotated according to 7 affective responses using the Common Affective Response Expression (CARE) method. The CARE method is a means of leveraging the signal that is present in comments that are posted in response to a post, providing high-precision evidence about the affective response of the readers to the post without human annotation. Unlike human annotation, the annotation process we describe here can be iterated upon to expand the coverage of the method, particularly for new affective responses. We present experiments that demonstrate that the CARE annotations compare favorably with crowd-sourced annotations. Finally, we use CARE$_{db}$ to train competitive BERT-based models for predicting affective response as well as emotion detection, demonstrating the utility of the dataset for related tasks.
    Using Machine Learning to Reduce Observational Biases When Detecting New Impacts on Mars. (arXiv:2207.05679v1 [cs.LG])
    The current inventory of recent (fresh) impacts on Mars shows a strong bias towards areas of low thermal inertia. These areas are generally visually bright, and impacts create dark scours and rays that make them easier to detect. It is expected that impacts occur at a similar rate in areas of higher thermal inertia, but those impacts are under-detected. This study investigates the use of a trained machine learning classifier to increase the detection of fresh impacts on Mars using CTX data. This approach discovered 69 new fresh impacts that have been confirmed with follow-up HiRISE images. We found that examining candidates partitioned by thermal inertia (TI) values, which is only possible due to the large number of machine learning candidates, helps reduce the observational bias and increase the number of known high-TI impacts.
    Dynamic Budget Throttling in Repeated Second-Price Auctions. (arXiv:2207.04690v2 [cs.GT] UPDATED)
    Throttling is one of the most popular budget control methods in today's online advertising markets. When a budget-constrained advertiser employs throttling, she can choose whether or not to participate in an auction after the advertising platform recommends a bid. This paper focuses on the dynamic budget throttling process in repeated second-price auctions from a theoretical view. An essential feature of the underlying problem is that the advertiser does not know the distribution of the highest competing bid upon entering the market. To model the difficulty of eliminating such uncertainty, we consider two different information structures. The advertiser could obtain the highest competing bid in each round with full-information feedback. Meanwhile, with partial information feedback, the advertiser could only have access to the highest competing bid in the auctions she participates in. We propose the OGD-CB algorithm, which involves simultaneous distribution learning and revenue optimization. In both settings, we demonstrate that this algorithm guarantees an $O(\sqrt{T\log T})$ regret with probability $1 - O(1/T)$ relative to the fluid adaptive throttling benchmark. By proving a lower bound of $\Omega(\sqrt{T})$ on the minimal regret for even the hindsight optimum, we establish the near optimality of our algorithm. Finally, we compare the fluid optimum of throttling to that of pacing, another widely adopted budget control method. The numerical relationship of these benchmarks sheds new light on the understanding of different online algorithms for revenue maximization under budget constraints.
    Shapley Computations Using Surrogate Model-Based Trees. (arXiv:2207.05214v1 [stat.ML])
    Shapley-related techniques have gained attention as both global and local interpretation tools because of their desirable properties. However, their computation using conditional expectations is computationally expensive. Approximation methods suggested in the literature have limitations. This paper proposes the use of a surrogate model-based tree to compute Shapley and SHAP values based on conditional expectation. Simulation studies show that the proposed algorithm provides improvements in accuracy, unifies global Shapley and SHAP interpretation, and the thresholding method provides a way to trade-off running time and accuracy.
    Benchmarking of eight recurrent neural network variants for breath phase and adventitious sound detection on a self-developed open-access lung sound database-HF_Lung_V1. (arXiv:2102.03049v3 [cs.SD] UPDATED)
    A reliable, remote, and continuous real-time respiratory sound monitor with automated respiratory sound analysis ability is urgently required in many clinical scenarios-such as in monitoring disease progression of coronavirus disease 2019-to replace conventional auscultation with a handheld stethoscope. However, a robust computerized respiratory sound analysis algorithm has not yet been validated in practical applications. In this study, we developed a lung sound database (HF_Lung_V1) comprising 9,765 audio files of lung sounds (duration of 15 s each), 34,095 inhalation labels, 18,349 exhalation labels, 13,883 continuous adventitious sound (CAS) labels (comprising 8,457 wheeze labels, 686 stridor labels, and 4,740 rhonchi labels), and 15,606 discontinuous adventitious sound labels (all crackles). We conducted benchmark tests for long short-term memory (LSTM), gated recurrent unit (GRU), bidirectional LSTM (BiLSTM), bidirectional GRU (BiGRU), convolutional neural network (CNN)-LSTM, CNN-GRU, CNN-BiLSTM, and CNN-BiGRU models for breath phase detection and adventitious sound detection. We also conducted a performance comparison between the LSTM-based and GRU-based models, between unidirectional and bidirectional models, and between models with and without a CNN. The results revealed that these models exhibited adequate performance in lung sound analysis. The GRU-based models outperformed, in terms of F1 scores and areas under the receiver operating characteristic curves, the LSTM-based models in most of the defined tasks. Furthermore, all bidirectional models outperformed their unidirectional counterparts. Finally, the addition of a CNN improved the accuracy of lung sound analysis, especially in the CAS detection tasks.
    Integrated multimodal artificial intelligence framework for healthcare applications. (arXiv:2202.12998v2 [cs.LG] UPDATED)
    Artificial intelligence (AI) systems hold great promise to improve healthcare over the next decades. Specifically, AI systems leveraging multiple data sources and input modalities are poised to become a viable method to deliver more accurate results and deployable pipelines across a wide range of applications. In this work, we propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal inputs. Our approach uses generalizable data pre-processing and machine learning modeling stages that can be readily adapted for research and deployment in healthcare environments. We evaluate our HAIM framework by training and characterizing 14,324 independent models based on MIMIC-IV-MM, a multimodal clinical database (N=34,537 samples) containing 7,279 unique hospitalizations and 6,485 patients, spanning all possible input combinations of 4 data modalities (i.e., tabular, time-series, text and images), 11 unique data sources and 12 predictive tasks. We show that this framework can consistently and robustly produce models that outperform similar single-source approaches across various healthcare demonstrations (by 6-33%), including 10 distinct chest pathology diagnoses, along with length-of-stay and 48-hour mortality predictions. We also quantify the contribution of each modality and data source using Shapley values, which demonstrates the heterogeneity in data type importance and the necessity of multimodal inputs across different healthcare-relevant tasks. The generalizable properties and flexibility of our Holistic AI in Medicine (HAIM) framework could offer a promising pathway for future multimodal predictive systems in clinical and operational healthcare settings.
    The Untold Impact of Learning Approaches on Software Fault-Proneness Predictions. (arXiv:2207.05710v1 [cs.SE])
    Software fault-proneness prediction is an active research area, with many factors affecting prediction performance extensively studied. However, the impact of the learning approach (i.e., the specifics of the data used for training and the target variable being predicted) on the prediction performance has not been studied, except for one initial work. This paper explores the effects of two learning approaches, useAllPredictAll and usePrePredictPost, on the performance of software fault-proneness prediction, both within-release and across-releases. The empirical results are based on data extracted from 64 releases of twelve open-source projects. Results show that the learning approach has a substantial, and typically unacknowledged, impact on the classification performance. Specifically, using useAllPredictAll leads to significantly better performance than using usePrePredictPost learning approach, both within-release and across-releases. Furthermore, this paper uncovers that, for within-release predictions, this difference in classification performance is due to different levels of class imbalance in the two learning approaches. When class imbalance is addressed, the performance difference between the learning approaches is eliminated. Our findings imply that the learning approach should always be explicitly identified and its impact on software fault-proneness prediction considered. The paper concludes with a discussion of potential consequences of our results for both research and practice.
    Exploring the Role of Task Transferability in Large-Scale Multi-Task Learning. (arXiv:2204.11117v2 [cs.CL] UPDATED)
    Recent work has found that multi-task training with a large number of diverse tasks can uniformly improve downstream performance on unseen target tasks. In contrast, literature on task transferability has established that the choice of intermediate tasks can heavily affect downstream task performance. In this work, we aim to disentangle the effect of scale and relatedness of tasks in multi-task representation learning. We find that, on average, increasing the scale of multi-task learning, in terms of the number of tasks, indeed results in better learned representations than smaller multi-task setups. However, if the target tasks are known ahead of time, then training on a smaller set of related tasks is competitive to the large-scale multi-task training at a reduced computational cost.
    A machine-learning-based tool for last closed magnetic flux surface reconstruction on tokamak. (arXiv:2207.05695v1 [physics.plasm-ph])
    Nuclear fusion power created by tokamak devices holds one of the most promising ways as a sustainable source of clean energy. One main challenge research field of tokamak is to predict the last closed magnetic flux surface (LCFS) determined by the interaction of the actuator coils and the internal tokamak plasma. This work requires high-dimensional, high-frequency, high-fidelity, real-time tools, further complicated by the wide range of actuator coils input interact with internal tokamak plasma states. In this work, we present a new machine learning model for reconstructing the LCFS from the Experimental Advanced Superconducting Tokamak (EAST) that learns automatically from the experimental data of EAST. This architecture can check the control strategy design and integrate it with the tokamak control system for real-time magnetic prediction. In the real-time modeling test, our approach achieves over 99% average similarity in LCFS reconstruction of the entire discharge process. In the offline magnetic reconstruction, our approach reaches over 93% average similarity.
    Capturing Evolution Genes for Time Series Data. (arXiv:1905.05004v2 [cs.LG] UPDATED)
    The modeling of time series is becoming increasingly critical in a wide variety of applications. Overall, data evolves by following different patterns, which are generally caused by different user behaviors. Given a time series, we define the evolution gene to capture the latent user behaviors and to describe how the behaviors lead to the generation of time series. In particular, we propose a uniform framework that recognizes different evolution genes of segments by learning a classifier, and adopt an adversarial generator to implement the evolution gene by estimating the segments' distribution. Experimental results based on a synthetic dataset and five real-world datasets show that our approach can not only achieve a good prediction results (e.g., averagely +10.56% in terms of F1), but is also able to provide explanations of the results.
    Equivariance versus Augmentation for Spherical Images. (arXiv:2202.03990v2 [cs.LG] UPDATED)
    We analyze the role of rotational equivariance in convolutional neural networks (CNNs) applied to spherical images. We compare the performance of the group equivariant networks known as S2CNNs and standard non-equivariant CNNs trained with an increasing amount of data augmentation. The chosen architectures can be considered baseline references for the respective design paradigms. Our models are trained and evaluated on single or multiple items from the MNIST or FashionMNIST dataset projected onto the sphere. For the task of image classification, which is inherently rotationally invariant, we find that by considerably increasing the amount of data augmentation and the size of the networks, it is possible for the standard CNNs to reach at least the same performance as the equivariant network. In contrast, for the inherently equivariant task of semantic segmentation, the non-equivariant networks are consistently outperformed by the equivariant networks with significantly fewer parameters. We also analyze and compare the inference latency and training times of the different networks, enabling detailed tradeoff considerations between equivariant architectures and data augmentation for practical problems. The equivariant spherical networks used in the experiments are available at https://github.com/JanEGerken/sem_seg_s2cnn .
    Federated Unlearning: How to Efficiently Erase a Client in FL?. (arXiv:2207.05521v1 [cs.LG])
    With privacy legislation empowering users with the right to be forgotten, it has become essential to make a model forget about some of its training data. We explore the problem of removing any client's contribution in federated learning (FL). During FL rounds, each client performs local training to learn a model that minimizes the empirical loss on their private data. We propose to perform unlearning at the client (to be erased) by reversing the learning process, i.e., training a model to \emph{maximize} the local empirical loss. In particular, we formulate the unlearning problem as a constrained maximization problem by restricting to an $\ell_2$-norm ball around a suitably chosen reference model to help retain some knowledge learnt from the other clients' data. This allows the client to use projected gradient descent to perform unlearning. The method does neither require global access to the data used for training nor the history of the parameter updates to be stored by the aggregator (server) or any of the clients. Experiments on the MNIST dataset show that the proposed unlearning method is efficient and effective.
    PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision. (arXiv:2112.09290v2 [cs.CV] UPDATED)
    In recent years, person detection and human pose estimation have made great strides, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis of human activities, poses, or context diversity. Additionally, privacy, legal, safety, and ethical concerns may limit the ability to collect more human data. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creation of synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. Therefore, we release a human-centric synthetic data generator PeopleSansPeople which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PeopleSansPeople, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using synthetic data and fine-tuning on various sizes of real-world data resulted in a keypoint AP increase of $+38.03$ ($44.43 \pm 0.17$ vs. $6.40$) for few-shot transfer (limited subsets of COCO-person train [2]), and an increase of $+1.47$ ($63.47 \pm 0.19$ vs. $62.00$) for abundant real data regimes, outperforming models trained with the same real data alone. We also found that our models outperformed those pre-trained with ImageNet with a keypoint AP increase of $+22.53$ ($44.43 \pm 0.17$ vs. $21.90$) for few-shot transfer and $+1.07$ ($63.47 \pm 0.19$ vs. $62.40$) for abundant real data regimes. This freely-available data generator should enable a wide range of research into the emerging field of simulation to real transfer learning in the critical area of human-centric computer vision.
    Pseudo value-based Deep Neural Networks for Multi-state Survival Analysis. (arXiv:2207.05291v1 [cs.LG])
    Multi-state survival analysis (MSA) uses multi-state models for the analysis of time-to-event data. In medical applications, MSA can provide insights about the complex disease progression in patients. A key challenge in MSA is the accurate subject-specific prediction of multi-state model quantities such as transition probability and state occupation probability in the presence of censoring. Traditional multi-state methods such as Aalen-Johansen (AJ) estimators and Cox-based methods are respectively limited by Markov and proportional hazards assumptions and are infeasible for making subject-specific predictions. Neural ordinary differential equations for MSA relax these assumptions but are computationally expensive and do not directly model the transition probabilities. To address these limitations, we propose a new class of pseudo-value-based deep learning models for multi-state survival analysis, where we show that pseudo values - designed to handle censoring - can be a natural replacement for estimating the multi-state model quantities when derived from a consistent estimator. In particular, we provide an algorithm to derive pseudo values from consistent estimators to directly predict the multi-state survival quantities from the subject's covariates. Empirical results on synthetic and real-world datasets show that our proposed models achieve state-of-the-art results under various censoring settings.
    PoeticTTS -- Controllable Poetry Reading for Literary Studies. (arXiv:2207.05549v1 [eess.AS])
    Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech. In this work, we propose an approach to synthesise poems with almost human like naturalness in order to enable literary scholars to systematically examine hypotheses on the interplay between text, spoken realisation, and the listener's perception of poems. To meet these special requirements for literary studies, we resynthesise poems by cloning prosodic values from a human reference recitation, and afterwards make use of fine-grained prosody control to manipulate the synthetic speech in a human-in-the-loop setting to alter the recitation w.r.t. specific phenomena. We find that finetuning our TTS model on poetry captures poetic intonation patterns to a large extent which is beneficial for prosody cloning and manipulation and verify the success of our approach both in an objective evaluation as well as in human studies.
    Policy Diagnosis via Measuring Role Diversity in Cooperative Multi-agent RL. (arXiv:2207.05683v1 [cs.MA])
    Cooperative multi-agent reinforcement learning (MARL) is making rapid progress for solving tasks in a grid world and real-world scenarios, in which agents are given different attributes and goals, resulting in different behavior through the whole multi-agent task. In this study, we quantify the agent's behavior difference and build its relationship with the policy performance via {\bf Role Diversity}, a metric to measure the characteristics of MARL tasks. We define role diversity from three perspectives: action-based, trajectory-based, and contribution-based to fully measure a multi-agent task. Through theoretical analysis, we find that the error bound in MARL can be decomposed into three parts that have a strong relation to the role diversity. The decomposed factors can significantly impact policy optimization on three popular directions including parameter sharing, communication mechanism, and credit assignment. The main experimental platforms are based on {\bf Multiagent Particle Environment (MPE)} and {\bf The StarCraft Multi-Agent Challenge (SMAC). Extensive experiments} clearly show that role diversity can serve as a robust measurement for the characteristics of a multi-agent cooperation task and help diagnose whether the policy fits the current multi-agent system for a better policy performance.
    Distributed Online System Identification for LTI Systems Using Reverse Experience Replay. (arXiv:2207.01062v1 [cs.LG] CROSS LISTED)
    Identification of linear time-invariant (LTI) systems plays an important role in control and reinforcement learning. Both asymptotic and finite-time offline system identification are well-studied in the literature. For online system identification, the idea of stochastic-gradient descent with reverse experience replay (SGD-RER) was recently proposed, where the data sequence is stored in several buffers and the stochastic-gradient descent (SGD) update performs backward in each buffer to break the time dependency between data points. Inspired by this work, we study distributed online system identification of LTI systems over a multi-agent network. We consider agents as identical LTI systems, and the network goal is to jointly estimate the system parameters by leveraging the communication between agents. We propose DSGD-RER, a distributed variant of the SGD-RER algorithm, and theoretically characterize the improvement of the estimation error with respect to the network size. Our numerical experiments certify the reduction of estimation error as the network size grows.
    Accelerated Deep Lossless Image Coding with Unified Paralleleized GPU Coding Architecture. (arXiv:2207.05152v1 [eess.IV])
    We propose Deep Lossless Image Coding (DLIC), a full resolution learned lossless image compression algorithm. Our algorithm is based on a neural network combined with an entropy encoder. The neural network performs a density estimation on each pixel of the source image. The density estimation is then used to code the target pixel, beating FLIF in terms of compression rate. Similar approaches have been attempted. However, long run times make them unfeasible for real world applications. We introduce a parallelized GPU based implementation, allowing for encoding and decoding of grayscale, 8-bit images in less than one second. Because DLIC uses a neural network to estimate the probabilities used for the entropy coder, DLIC can be trained on domain specific image data. We demonstrate this capability by adapting and training DLIC with Magnet Resonance Imaging (MRI) images.
    On robust risk-based active-learning algorithms for enhanced decision support. (arXiv:2201.02555v2 [cs.LG] UPDATED)
    Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced risk-based active learning, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to expected value of perfect information (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: semi-supervised learning, and discriminative classification models. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.
    IMG-NILM: A Deep learning NILM approach using energy heatmaps. (arXiv:2207.05463v1 [cs.LG])
    Energy disaggregation estimates appliance-by-appliance electricity consumption from a single meter that measures the whole home's electricity demand. Compared with intrusive load monitoring, NILM (Non-intrusive load monitoring) is low cost, easy to deploy, and flexible. In this paper, we propose a new method, coined IMG-NILM, that utilises convolutional neural networks (CNN) to disaggregate electricity data represented as images. CNN is proven to be efficient with images, hence, instead of the traditional representation of electricity data as time series, data is transformed into heatmaps with higher electricity readings portrayed as 'hotter' colours. The image representation is then used in CNN to detect the signature of an appliance from aggregated data. IMG-NILM is flexible and shows consistent performance in disaggregating various types of appliances; including single and multiple states. It attains a test accuracy of up to 93% on the UK dale dataset within a single house, where a substantial number of appliances are present. In more challenging settings where electricity data is collected from different houses, IMG-NILM attains also a very good average accuracy of 85%.
    Fast Yet Effective Machine Unlearning. (arXiv:2111.08947v4 [cs.LG] UPDATED)
    Unlearning the data observed during the training of a machine learning (ML) model is an important task that can play a pivotal role in fortifying the privacy and security of ML-based applications. This paper raises the following questions: (i) can we unlearn a single or multiple classes of data from an ML model without looking at the full training data even once? (ii) can we make the process of unlearning fast and scalable to large datasets, and generalize it to different deep networks? We introduce a novel machine unlearning framework with error-maximizing noise generation and impair-repair based weight manipulation that offers an efficient solution to the above questions. An error-maximizing noise matrix is learned for the class to be unlearned using the original model. The noise matrix is used to manipulate the model weights to unlearn the targeted class of data. We introduce impair and repair steps for a controlled manipulation of the network weights. In the impair step, the noise matrix along with a very high learning rate is used to induce sharp unlearning in the model. Thereafter, the repair step is used to regain the overall performance. With very few update steps, we show excellent unlearning while substantially retaining the overall model accuracy. Unlearning multiple classes requires a similar number of update steps as for the single class, making our approach scalable to large problems. Our method is quite efficient in comparison to the existing methods, works for multi-class unlearning, doesn't put any constraints on the original optimization mechanism or network design, and works well in both small and large-scale vision tasks. This work is an important step towards fast and easy implementation of unlearning in deep networks. We will make the source code publicly available.
    How Robust is your Fair Model? Exploring the Robustness of Diverse Fairness Strategies. (arXiv:2207.04581v2 [cs.LG] UPDATED)
    With the introduction of machine learning in high-stakes decision making, ensuring algorithmic fairness has become an increasingly important problem to solve. In response to this, many mathematical definitions of fairness have been proposed, and a variety of optimisation techniques have been developed, all designed to maximise a defined notion of fairness. However, fair solutions are reliant on the quality of the training data, and can be highly sensitive to noise. Recent studies have shown that robustness (the ability for a model to perform well on unseen data) plays a significant role in the type of strategy that should be used when approaching a new problem and, hence, measuring the robustness of these strategies has become a fundamental problem. In this work, we therefore propose a new criterion to measure the robustness of various fairness optimisation strategies - the robustness ratio. We conduct multiple extensive experiments on five bench mark fairness data sets using three of the most popular fairness strategies with respect to four of the most popular definitions of fairness. Our experiments empirically show that fairness methods that rely on threshold optimisation are very sensitive to noise in all the evaluated data sets, despite mostly outperforming other methods. This is in contrast to the other two methods, which are less fair for low noise scenarios but fairer for high noise ones. To the best of our knowledge, we are the first to quantitatively evaluate the robustness of fairness optimisation strategies. This can potentially can serve as a guideline in choosing the most suitable fairness strategy for various data sets.
    Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique. (arXiv:2207.05261v1 [cs.CL])
    We present an efficient framework of corpus for sign language translation. Aided with a simple but dramatic data augmentation technique, our method converts text into annotated forms with minimum information loss. Sign languages are composed of manual signals, non-manual signals, and iconic features. According to professional sign language interpreters, non-manual signals such as facial expressions and gestures play an important role in conveying exact meaning. By considering the linguistic features of sign language, our proposed framework is a first and unique attempt to build a multimodal sign language augmentation corpus (hereinafter referred to as the KoSLA corpus) containing both manual and non-manual modalities. The corpus we built demonstrates confident results in the hospital context, showing improved performance with augmented datasets. To overcome data scarcity, we resorted to data augmentation techniques such as synonym replacement to boost the efficiency of our translation model and available data, while maintaining grammatical and semantic structures of sign language. For the experimental support, we verify the effectiveness of data augmentation technique and usefulness of our corpus by performing a translation task between normal sentences and sign language annotations on two tokenizers. The result was convincing, proving that the BLEU scores with the KoSLA corpus were significant.
    Online Meta-Learning in Adversarial Multi-Armed Bandits. (arXiv:2205.15921v2 [cs.LG] UPDATED)
    We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.
    Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction. (arXiv:2203.11444v4 [cs.LG] UPDATED)
    Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions, where the molecular graph topology is largely unaltered from reactants to products, resulting in the suboptimal performance of SMILES if straightforwardly applied. In this article, we propose the root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one mapping between the product and the reactant SMILES for more efficient synthesis prediction. Due to the strict one-to-one mapping and reduced edit distance, the computational model is largely relieved from learning the complex syntax and dedicated to learning the chemical knowledge for reactions. We compare the proposed R-SMILES with various state-of-the-art baselines and show that it significantly outperforms them all, demonstrating the superiority of the proposed method.
    Efficient NLP Inference at the Edge via Elastic Pipelining. (arXiv:2207.05022v2 [cs.LG] UPDATED)
    Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs a few seconds long IO, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the large skewness between IO and computation delays. To this end, we propose WRX. Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, WRX reconciles the latency/memory tension via two novel techniques. First, model sharding. WRX manages model parameters as independently tunable shards and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. WRX instantiates an IO/computation pipeline and uses a small buffer for preload shards to bootstrap execution without stalling in early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, which maximizes inference accuracy. Atop two commodity SoCs, we build WRX and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that, WRX delivers high accuracies with 1--2 orders of magnitude lower memory, outperforming competitive baselines.
    Recent Developments in AI and USPTO Open Data. (arXiv:2207.05239v1 [cs.LG])
    The USPTO disseminates one of the largest publicly accessible repositories of scientific, technical, and commercial data worldwide. USPTO data has historically seen frequent use in fields such as patent analytics, economics, and prosecution & litigation tools. This article highlights an emerging class of usecases directed to the research, development, and application of artificial intelligence technology. Such usecases contemplate both the delivery of artificial intelligence capabilities for practical IP applications and the enablement of future state-of-the-art artificial intelligence research via USPTO data products. Examples from both within and beyond the USPTO are offered as case studies.
    Bi-fidelity Evolutionary Multiobjective Search for Adversarially Robust Deep Neural Architectures. (arXiv:2207.05321v1 [cs.LG])
    Deep neural networks have been found vulnerable to adversarial attacks, thus raising potentially concerns in security-sensitive contexts. To address this problem, recent research has investigated the adversarial robustness of deep neural networks from the architectural point of view. However, searching for architectures of deep neural networks is computationally expensive, particularly when coupled with adversarial training process. To meet the above challenge, this paper proposes a bi-fidelity multiobjective neural architecture search approach. First, we formulate the NAS problem for enhancing adversarial robustness of deep neural networks into a multiobjective optimization problem. Specifically, in addition to a low-fidelity performance predictor as the first objective, we leverage an auxiliary-objective -- the value of which is the output of a surrogate model trained with high-fidelity evaluations. Secondly, we reduce the computational cost by combining three performance estimation methods, i.e., parameter sharing, low-fidelity evaluation, and surrogate-based predictor. The effectiveness of the proposed approach is confirmed by extensive experiments conducted on CIFAR-10, CIFAR-100 and SVHN datasets.
    A Dataset Perspective on Offline Reinforcement Learning. (arXiv:2111.04714v2 [cs.LG] UPDATED)
    The application of Reinforcement Learning (RL) in real world environments can be expensive or risky due to sub-optimal policies during training. In Offline RL, this problem is avoided since interactions with an environment are prohibited. Policies are learned from a given dataset, which solely determines their performance. Despite this fact, how dataset characteristics influence Offline RL algorithms is still hardly investigated. The dataset characteristics are determined by the behavioral policy that samples this dataset. Therefore, we define characteristics of behavioral policies as exploratory for yielding high expected information in their interaction with the Markov Decision Process (MDP) and as exploitative for having high expected return. We implement two corresponding empirical measures for the datasets sampled by the behavioral policy in deterministic MDPs. The first empirical measure SACo is defined by the normalized unique state-action pairs and captures exploration. The second empirical measure TQ is defined by the normalized average trajectory return and captures exploitation. Empirical evaluations show the effectiveness of TQ and SACo. In large-scale experiments using our proposed measures, we show that the unconstrained off-policy Deep Q-Network family requires datasets with high SACo to find a good policy. Furthermore, experiments show that policy constraint algorithms perform well on datasets with high TQ and SACo. Finally, the experiments show, that purely dataset-constrained Behavioral Cloning performs competitively to the best Offline RL algorithms for datasets with high TQ.
    Unsupervised learning of observation functions in state-space models by nonparametric moment methods. (arXiv:2207.05242v1 [stat.ML])
    We investigate the unsupervised learning of non-invertible observation functions in nonlinear state-space models. Assuming abundant data of the observation process along with the distribution of the state process, we introduce a nonparametric generalized moment method to estimate the observation function via constrained regression. The major challenge comes from the non-invertibility of the observation function and the lack of data pairs between the state and observation. We address the fundamental issue of identifiability from quadratic loss functionals and show that the function space of identifiability is the closure of a RKHS that is intrinsic to the state process. Numerical results show that the first two moments and temporal correlations, along with upper and lower bounds, can identify functions ranging from piecewise polynomials to smooth functions, leading to convergent estimators. The limitations of this method, such as non-identifiability due to symmetry and stationarity, are also discussed.
    IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training. (arXiv:2207.05333v1 [cs.CV])
    Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.
    The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress. (arXiv:2207.05691v1 [cs.LG])
    The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions. Using the introduced datasets, MuSe 2022 2022 addresses three contemporary affective computing problems: in the Humor Detection Sub-Challenge (MuSe-Humor), spontaneous humour has to be recognised; in the Emotional Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained `in-the-wild' emotions have to be predicted; and in the Emotional Stress Sub-Challenge (MuSe-Stress), a continuous prediction of stressed emotion values is featured. The challenge is designed to attract different research communities, encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the communities of audio-visual emotion recognition, health informatics, and symbolic sentiment analysis. This baseline paper describes the datasets as well as the feature sets extracted from them. A recurrent neural network with LSTM cells is used to set competitive baseline results on the test partitions for each sub-challenge. We report an Area Under the Curve (AUC) of .8480 for MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and .4761 for valence and arousal in MuSe-Stress, respectively.
    Conservative SPDEs as fluctuating mean field limits of stochastic gradient descent. (arXiv:2207.05705v1 [math.PR])
    The convergence of stochastic interacting particle systems in the mean-field limit to solutions to conservative stochastic partial differential equations is shown, with optimal rate of convergence. As a second main result, a quantitative central limit theorem for such SPDEs is derived, again with optimal rate of convergence. The results apply in particular to the convergence in the mean-field scaling of stochastic gradient descent dynamics in overparametrized, shallow neural networks to solutions to SPDEs. It is shown that the inclusion of fluctuations in the limiting SPDE improves the rate of convergence, and retains information about the fluctuations of stochastic gradient descent in the continuum limit.
    Denoising single images by feature ensemble revisited. (arXiv:2207.05176v1 [cs.CV])
    Image denoising is still a challenging issue in many computer vision sub-domains. Recent studies show that significant improvements are made possible in a supervised setting. However, few challenges, such as spatial fidelity and cartoon-like smoothing remain unresolved or decisively overlooked. Our study proposes a simple yet efficient architecture for the denoising problem that addresses the aforementioned issues. The proposed architecture revisits the concept of modular concatenation instead of long and deeper cascaded connections, to recover a cleaner approximation of the given image. We find that different modules can capture versatile representations, and concatenated representation creates a richer subspace for low-level image restoration. The proposed architecture's number of parameters remains smaller than the number for most of the previous networks and still achieves significant improvements over the current state-of-the-art networks.
    A Data-Based Perspective on Transfer Learning. (arXiv:2207.05739v1 [cs.LG])
    It is commonly believed that in transfer learning including more pre-training data translates into better performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we take a closer look at the role of the source dataset's composition in transfer learning and present a framework for probing its impact on downstream performance. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness as well as detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset. In particular, we demonstrate that removing detrimental datapoints identified by our framework improves transfer learning performance from ImageNet on a variety of target tasks. Code is available at https://github.com/MadryLab/data-transfer
    Histopathological Imaging Classification of Breast Tissue for Cancer Diagnosis Support Using Deep Learning Models. (arXiv:2207.05057v1 [eess.IV])
    According to some medical imaging techniques, breast histopathology images called Hematoxylin and Eosin are considered as the gold standard for cancer diagnoses. Based on the idea of dividing the pathologic image (WSI) into multiple patches, we used the window [512,512] sliding from left to right and sliding from top to bottom, each sliding step overlapping by 50% to augmented data on a dataset of 400 images which were gathered from the ICIAR 2018 Grand Challenge. Then use the EffficientNet model to classify and identify the histopathological images of breast cancer into 4 types: Normal, Benign, Carcinoma, Invasive Carcinoma. The EffficientNet model is a recently developed model that uniformly scales the width, depth, and resolution of the network with a set of fixed scaling factors that are well suited for training images with high resolution. And the results of this model give a rather competitive classification efficiency, achieving 98% accuracy on the training set and 93% on the evaluation set.
    An Information-Theoretic Analysis for Transfer Learning: Error Bounds and Applications. (arXiv:2207.05377v1 [cs.IT])
    Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis on the generalization error and excess risk of transfer learning algorithms, following a line of work initiated by Russo and Xu. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence $D(\mu||\mu')$ plays an important role in the characterizations where $\mu$ and $\mu'$ denote the distribution of the training data and the testing test, respectively. Specifically, we provide generalization error upper bounds for the empirical risk minimization (ERM) algorithm where data from both distributions are available in the training phase. We further apply the analysis to approximated ERM methods such as the Gibbs algorithm and the stochastic gradient descent method. We then generalize the mutual information bound with $\phi$-divergence and Wasserstein distance. These generalizations lead to tighter bounds and can handle the case when $\mu$ is not absolutely continuous with respect to $\mu'$. Furthermore, we apply a new set of techniques to obtain an alternative upper bound which gives a fast (and optimal) learning rate for some learning problems. Finally, inspired by the derived bounds, we propose the InfoBoost algorithm in which the importance weights for source and target data are adjusted adaptively in accordance to information measures. The empirical results show the effectiveness of the proposed algorithm.
    Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning. (arXiv:2207.05742v1 [cs.LG])
    In lifelong learning, an agent learns throughout its entire life without resets, in a constantly changing environment, as we humans do. Consequently, lifelong learning comes with a plethora of research problems such as continual domain shifts, which result in non-stationary rewards and environment dynamics. These non-stationarities are difficult to detect and cope with due to their continuous nature. Therefore, exploration strategies and learning methods are required that are capable of tracking the steady domain shifts, and adapting to them. We propose Reactive Exploration to track and react to continual domain shifts in lifelong reinforcement learning, and to update the policy correspondingly. To this end, we conduct experiments in order to investigate different exploration strategies. We empirically show that representatives of the policy-gradient family are better suited for lifelong learning, as they adapt more quickly to distribution shifts than Q-learning. Thereby, policy-gradient methods profit the most from Reactive Exploration and show good results in lifelong learning with continual domain shifts. Our code is available at: https://github.com/ml-jku/reactive-exploration.
    SWIS: Self-Supervised Representation Learning For Writer Independent Offline Signature Verification. (arXiv:2202.13078v2 [cs.CV] UPDATED)
    Writer independent offline signature verification is one of the most challenging tasks in pattern recognition as there is often a scarcity of training data. To handle such data scarcity problem, in this paper, we propose a novel self-supervised learning (SSL) framework for writer independent offline signature verification. To our knowledge, this is the first attempt to utilize self-supervised setting for the signature verification task. The objective of self-supervised representation learning from the signature images is achieved by minimizing the cross-covariance between two random variables belonging to different feature directions and ensuring a positive cross-covariance between the random variables denoting the same feature direction. This ensures that the features are decorrelated linearly and the redundant information is discarded. Through experimental results on different data sets, we obtained encouraging results.
    Remote sensing and AI for building climate adaptation applications. (arXiv:2107.02693v2 [cs.LG] UPDATED)
    Urban areas are not only one of the biggest contributors to climate change, but also they are one of the most vulnerable areas with high populations who would together experience the negative impacts. In this paper, we address some of the opportunities brought by satellite remote sensing imaging and artificial intelligence (AI) in order to measure climate adaptation of cities automatically. We propose a framework combining AI and simulation which may be useful for extracting indicators from remote-sensing images and may help with predictive estimation of future states of these climate-adaptation-related indicators. When such models become more robust and used in real life applications, they may help decision makers and early responders to choose the best actions to sustain the well-being of society, natural resources and biodiversity. We underline that this is an open field and an on-going area of research for many scientists, therefore we offer an in-depth discussion on the challenges and limitations of data-driven methods and the predictive estimation models in general.
    DeepTx: Deep Learning Beamforming with Channel Prediction. (arXiv:2202.07998v3 [eess.SP] UPDATED)
    Machine learning algorithms have recently been considered for many tasks in the field of wireless communications. Previously, we have proposed the use of a deep fully convolutional neural network (CNN) for receiver processing and shown it to provide considerable performance gains. In this study, we focus on machine learning algorithms for the transmitter. In particular, we consider beamforming and propose a CNN which, for a given uplink channel estimate as input, outputs downlink channel information to be used for beamforming. The CNN is trained in a supervised manner considering both uplink and downlink transmissions with a loss function that is based on UE receiver performance. The main task of the neural network is to predict the channel evolution between uplink and downlink slots, but it can also learn to handle inefficiencies and errors in the whole chain, including the actual beamforming phase. The provided numerical experiments demonstrate the improved beamforming performance.
    Optimal Clustering with Noisy Queries via Multi-Armed Bandit. (arXiv:2207.05376v1 [cs.LG])
    Motivated by many applications, we study clustering with a faulty oracle. In this problem, there are $n$ items belonging to $k$ unknown clusters, and the algorithm is allowed to ask the oracle whether two items belong to the same cluster or not. However, the answer from the oracle is correct only with probability $\frac{1}{2}+\frac{\delta}{2}$. The goal is to recover the hidden clusters with minimum number of noisy queries. Previous works have shown that the problem can be solved with $O(\frac{nk\log n}{\delta^2} + \text{poly}(k,\frac{1}{\delta}, \log n))$ queries, while $\Omega(\frac{nk}{\delta^2})$ queries is known to be necessary. So, for any values of $k$ and $\delta$, there is still a non-trivial gap between upper and lower bounds. In this work, we obtain the first matching upper and lower bounds for a wide range of parameters. In particular, a new polynomial time algorithm with $O(\frac{n(k+\log n)}{\delta^2} + \text{poly}(k,\frac{1}{\delta}, \log n))$ queries is proposed. Moreover, we prove a new lower bound of $\Omega(\frac{n\log n}{\delta^2})$, which, combined with the existing $\Omega(\frac{nk}{\delta^2})$ bound, matches our upper bound up to an additive $\text{poly}(k,\frac{1}{\delta},\log n)$ term. To obtain the new results, our main ingredient is an interesting connection between our problem and multi-armed bandit, which might provide useful insights for other similar problems.
    Solving a directed percolation inverse problem. (arXiv:2201.12222v3 [cond-mat.dis-nn] UPDATED)
    We present a directed percolation inverse problem for diode networks: Given information about which pairs of nodes allow current to percolate from one to the other, can one find a configuration of diodes consistent with the observed currents? We implement a divide-and-concur iterative projection method for solving the problem and demonstrate the supremacy of our method over an exhaustive approach for nontrivial instances of the problem. We find that the problem is most difficult when some but not all of the percolation data are hidden, and that the most difficult networks to reconstruct generally are those for which the currents are most sensitive to the addition or removal of a single diode.
    End-to-end speech recognition modeling from de-identified data. (arXiv:2207.05469v1 [eess.AS])
    De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy, especially in the medical domain. However, simply removing all personally identifiable information (PII) from end-to-end model training data leads to a significant performance degradation in particular for the recognition of names, dates, locations, and words from similar categories. We propose and evaluate a two-step method for partially recovering this loss. First, PII is identified, and each occurrence is replaced with a random word sequence of the same category. Then, corresponding audio is produced via text-to-speech or by splicing together matching audio fragments extracted from the corpus. These artificial audio/label pairs, together with speaker turns from the original data without PII, are used to train models. We evaluate the performance of this method on in-house data of medical conversations and observe a recovery of almost the entire performance degradation in the general word error rate while still maintaining a strong diarization performance. Our main focus is the improvement of recall and precision in the recognition of PII-related words. Depending on the PII category, between $50\% - 90\%$ of the performance degradation can be recovered using our proposed method.
    WheaCha: A Method for Explaining the Predictions of Models of Code. (arXiv:2102.04625v3 [cs.LG] UPDATED)
    Attribution methods have emerged as a popular approach to interpreting model predictions based on the relevance of input features. Although the feature importance ranking can provide insights of how models arrive at a prediction from a raw input, they do not give a clear-cut definition of the key features models use for the prediction. In this paper, we present a new method, called WheaCha, for explaining the predictions of code models. Although WheaCha employs the same mechanism of tracing model predictions back to the input features, it differs from all existing attribution methods in crucial ways. Specifically, WheaCha divides an input program into "wheat" (i.e., the defining features that are the reason for which models predict the label that they predict) and the rest "chaff" for any prediction of a learned code model. We realize WheaCha in a tool, HuoYan, and use it to explain four prominent code models: code2vec, seq-GNN, GGNN, and CodeBERT. Results show (1) HuoYan is efficient - taking on average under twenty seconds to compute the wheat for an input program in an end-to-end fashion (i.e., including model prediction time); (2) the wheat that all models use to predict input programs is made of simple syntactic or even lexical properties (i.e., identifier names); (3) Based on wheat, we present a novel approach to explaining the predictions of code models through the lens of training data.
    Learning with Noisy Labels by Efficient Transition Matrix Estimation to Combat Label Miscorrection. (arXiv:2111.14932v2 [cs.LG] UPDATED)
    Recent studies on learning with noisy labels have shown remarkable performance by exploiting a small clean dataset. In particular, model agnostic meta-learning-based label correction methods further improve performance by correcting noisy labels on the fly. However, there is no safeguard on the label miscorrection, resulting in unavoidable performance degradation. Moreover, every training step requires at least three back-propagations, significantly slowing down the training speed. To mitigate these issues, we propose a robust and efficient method that learns a label transition matrix on the fly. Employing the transition matrix makes the classifier skeptical about all the corrected samples, which alleviates the miscorrection issue. We also introduce a two-head architecture to efficiently estimate the label transition matrix every iteration within a single back-propagation, so that the estimated matrix closely follows the shifting noise distribution induced by label correction. Extensive experiments demonstrate that our approach shows the best performance in training efficiency while having comparable or better accuracy than existing methods.
    MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes. (arXiv:2205.09248v2 [cs.SD] UPDATED)
    We propose a mesh-based neural network (MESH2IR) to generate acoustic impulse responses (IRs) for indoor 3D scenes represented using a mesh. The IRs are used to create a high-quality sound experience in interactive applications and audio processing. Our method can handle input triangular meshes with arbitrary topologies (2K - 3M triangles). We present a novel training technique to train MESH2IR using energy decay relief and highlight its benefits. We also show that training MESH2IR on IRs preprocessed using our proposed technique significantly improves the accuracy of IR generation. We reduce the non-linearity in the mesh space by transforming 3D scene meshes to latent space using a graph convolution network. Our MESH2IR is more than 200 times faster than a geometric acoustic algorithm on a CPU and can generate more than 10,000 IRs per second on an NVIDIA GeForce RTX 2080 Ti GPU for a given furnished indoor 3D scene. The acoustic metrics are used to characterize the acoustic environment. We show that the acoustic metrics of the IRs predicted from our MESH2IR match the ground truth with less than 10% error. We also highlight the benefits of MESH2IR on audio and speech processing applications such as speech dereverberation and speech separation. To the best of our knowledge, ours is the first neural-network-based approach to predict IRs from a given 3D scene mesh in real-time.
    Brain-inspired Graph Spiking Neural Networks for Commonsense Knowledge Representation and Reasoning. (arXiv:2207.05561v1 [cs.NE])
    How neural networks in the human brain represent commonsense knowledge, and complete related reasoning tasks is an important research topic in neuroscience, cognitive science, psychology, and artificial intelligence. Although the traditional artificial neural network using fixed-length vectors to represent symbols has gained good performance in some specific tasks, it is still a black box that lacks interpretability, far from how humans perceive the world. Inspired by the grandmother-cell hypothesis in neuroscience, this work investigates how population encoding and spiking timing-dependent plasticity (STDP) mechanisms can be integrated into the learning of spiking neural networks, and how a population of neurons can represent a symbol via guiding the completion of sequential firing between different neuron populations. The neuron populations of different communities together constitute the entire commonsense knowledge graph, forming a giant graph spiking neural network. Moreover, we introduced the Reward-modulated spiking timing-dependent plasticity (R-STDP) mechanism to simulate the biological reinforcement learning process and completed the related reasoning tasks accordingly, achieving comparable accuracy and faster convergence speed than the graph convolutional artificial neural networks. For the fields of neuroscience and cognitive science, the work in this paper provided the foundation of computational modeling for further exploration of the way the human brain represents commonsense knowledge. For the field of artificial intelligence, this paper indicated the exploration direction for realizing a more robust and interpretable neural network by constructing a commonsense knowledge representation and reasoning spiking neural networks with solid biological plausibility.
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v3 [stat.ML] UPDATED)
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.
    MAVIPER: Learning Decision Tree Policies for Interpretable Multi-Agent Reinforcement Learning. (arXiv:2205.12449v2 [cs.LG] UPDATED)
    Many recent breakthroughs in multi-agent reinforcement learning (MARL) require the use of deep neural networks, which are challenging for human experts to interpret and understand. On the other hand, existing work on interpretable reinforcement learning (RL) has shown promise in extracting more interpretable decision tree-based policies from neural networks, but only in the single-agent setting. To fill this gap, we propose the first set of algorithms that extract interpretable decision-tree policies from neural networks trained with MARL. The first algorithm, IVIPER, extends VIPER, a recent method for single-agent interpretable RL, to the multi-agent setting. We demonstrate that IVIPER learns high-quality decision-tree policies for each agent. To better capture coordination between agents, we propose a novel centralized decision-tree training algorithm, MAVIPER. MAVIPER jointly grows the trees of each agent by predicting the behavior of the other agents using their anticipated trees, and uses resampling to focus on states that are critical for its interactions with other agents. We show that both algorithms generally outperform the baselines and that MAVIPER-trained agents achieve better-coordinated performance than IVIPER-trained agents on three different multi-agent particle-world environments.
    Transferability-Guided Cross-Domain Cross-Task Transfer Learning. (arXiv:2207.05510v1 [cs.CV])
    We propose two novel transferability metrics F-OTCE (Fast Optimal Transport based Conditional Entropy) and JC-OTCE (Joint Correspondence OTCE) to evaluate how much the source model (task) can benefit the learning of the target task and to learn more transferable representations for cross-domain cross-task transfer learning. Unlike the existing metric that requires evaluating the empirical transferability on auxiliary tasks, our metrics are auxiliary-free such that they can be computed much more efficiently. Specifically, F-OTCE estimates transferability by first solving an Optimal Transport (OT) problem between source and target distributions, and then uses the optimal coupling to compute the Negative Conditional Entropy between source and target labels. It can also serve as a loss function to maximize the transferability of the source model before finetuning on the target task. Meanwhile, JC-OTCE improves the transferability robustness of F-OTCE by including label distances in the OT problem, though it may incur additional computation cost. Extensive experiments demonstrate that F-OTCE and JC-OTCE outperform state-of-the-art auxiliary-free metrics by 18.85% and 28.88%, respectively in correlation coefficient with the ground-truth transfer accuracy. By eliminating the training cost of auxiliary tasks, the two metrics reduces the total computation time of the previous method from 43 minutes to 9.32s and 10.78s, respectively, for a pair of tasks. When used as a loss function, F-OTCE shows consistent improvements on the transfer accuracy of the source model in few-shot classification experiments, with up to 4.41% accuracy gain.
    DDI Prediction via Heterogeneous Graph Attention Networks. (arXiv:2207.05672v1 [cs.LG])
    Polypharmacy, defined as the use of multiple drugs together, is a standard treatment method, especially for severe and chronic diseases. However, using multiple drugs together may cause interactions between drugs. Drug-drug interaction (DDI) is the activity that occurs when the impact of one drug changes when combined with another. DDIs may obstruct, increase, or decrease the intended effect of either drug or, in the worst-case scenario, create adverse side effects. While it is critical to detect DDIs on time, it is timeconsuming and expensive to identify them in clinical trials due to their short duration and many possible drug pairs to be considered for testing. As a result, computational methods are needed for predicting DDIs. In this paper, we present a novel heterogeneous graph attention model, HAN-DDI to predict drug-drug interactions. We create a heterogeneous network of drugs with different biological entities. Then, we develop a heterogeneous graph attention network to learn DDIs using relations of drugs with other entities. It consists of an attention-based heterogeneous graph node encoder for obtaining drug node representations and a decoder for predicting drug-drug interactions. Further, we utilize comprehensive experiments to evaluate of our model and to compare it with state-of-the-art models. Experimental results show that our proposed method, HAN-DDI, outperforms the baselines significantly and accurately predicts DDIs, even for new drugs.
    Large Language Models Can Be Strong Differentially Private Learners. (arXiv:2110.05679v3 [cs.LG] UPDATED)
    Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and attempts at straightforwardly applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained models; (2) hyperparameters that suit DP optimization; and (3) fine-tuning objectives aligned with the pretraining procedure. With these factors set right, we obtain private NLP models that outperform state-of-the-art private training approaches and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained models tends to not suffer from dimension-dependent performance degradation.
    Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment. (arXiv:2203.15937v3 [eess.AS] UPDATED)
    Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only fine-tuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.
    Accelerated Reinforcement Learning for Temporal Logic Control Objectives. (arXiv:2205.04424v3 [cs.RO] UPDATED)
    This paper addresses the problem of learning control policies for mobile robots, modeled as unknown Markov Decision Processes (MDPs), that are tasked with temporal logic missions, such as sequencing, coverage, or surveillance. The MDP captures uncertainty in the workspace structure and the outcomes of control decisions. The control objective is to synthesize a control policy that maximizes the probability of accomplishing a high-level task, specified as a Linear Temporal Logic (LTL) formula. To address this problem, we propose a novel accelerated model-based reinforcement learning (RL) algorithm for LTL control objectives that is capable of learning control policies significantly faster than related approaches. Its sample-efficiency relies on biasing exploration towards directions that may contribute to task satisfaction. This is accomplished by leveraging an automaton representation of the LTL task as well as a continuously learned MDP model. Finally, we provide comparative experiments that demonstrate the sample efficiency of the proposed method against recent RL methods for LTL objectives.
    Improving the Robustness and Generalization of Deep Neural Network with Confidence Threshold Reduction. (arXiv:2206.00913v2 [cs.LG] UPDATED)
    Deep neural networks are easily attacked by imperceptible perturbation. Presently, adversarial training (AT) is the most effective method to enhance the robustness of the model against adversarial examples. However, because adversarial training solved a min-max value problem, in comparison with natural training, the robustness and generalization are contradictory, i.e., the robustness improvement of the model will decrease the generalization of the model. To address this issue, in this paper, a new concept, namely confidence threshold (CT), is introduced and the reducing of the confidence threshold, known as confidence threshold reduction (CTR), is proven to improve both the generalization and robustness of the model. Specifically, to reduce the CT for natural training (i.e., for natural training with CTR), we propose a mask-guided divergence loss function (MDL) consisting of a cross-entropy loss term and an orthogonal term. The empirical and theoretical analysis demonstrates that the MDL loss improves the robustness and generalization of the model simultaneously for natural training. However, the model robustness improvement of natural training with CTR is not comparable to that of adversarial training. Therefore, for adversarial training, we propose a standard deviation loss function (STD), which minimizes the difference in the probabilities of the wrong categories, to reduce the CT by being integrated into the loss function of adversarial training. The empirical and theoretical analysis demonstrates that the STD based loss function can further improve the robustness of the adversarially trained model on basis of guaranteeing the changeless or slight improvement of the natural accuracy.
    Differentiable Physics Simulations with Contacts: Do They Have Correct Gradients w.r.t. Position, Velocity and Control?. (arXiv:2207.05060v1 [cs.LG])
    In recent years, an increasing amount of work has focused on differentiable physics simulation and has produced a set of open source projects such as Tiny Differentiable Simulator, Nimble Physics, diffTaichi, Brax, Warp, Dojo and DiffCoSim. By making physics simulations end-to-end differentiable, we can perform gradient-based optimization and learning tasks. A majority of differentiable simulators consider collisions and contacts between objects, but they use different contact models for differentiability. In this paper, we overview four kinds of differentiable contact formulations - linear complementarity problems (LCP), convex optimization models, compliant models and position-based dynamics (PBD). We analyze and compare the gradients calculated by these models and show that the gradients are not always correct. We also demonstrate their ability to learn an optimal control strategy by comparing the learned strategies with the optimal strategy in an analytical form. The codebase to reproduce the experiment results is available at https://github.com/DesmondZhong/diff_sim_grads.
    DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization. (arXiv:2207.05631v1 [cs.LG])
    Recent algorithms designed for reinforcement learning tasks focus on finding a single optimal solution. However, in many practical applications, it is important to develop reasonable agents with diverse strategies. In this paper, we propose Diversity-Guided Policy Optimization (DGPO), an on-policy framework for discovering multiple strategies for the same task. Our algorithm uses diversity objectives to guide a latent code conditioned policy to learn a set of diverse strategies in a single training procedure. Specifically, we formalize our algorithm as the combination of a diversity-constrained optimization problem and an extrinsic-reward constrained optimization problem. And we solve the constrained optimization as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently finds diverse strategies in a wide variety of reinforcement learning tasks. We further show that DGPO achieves a higher diversity score and has similar sample complexity and performance compared to other baselines.
    CANF-VC: Conditional Augmented Normalizing Flows for Video Compression. (arXiv:2207.05315v1 [cs.CV])
    This paper presents an end-to-end learning-based video compression system, termed CANF-VC, based on conditional augmented normalizing flows (ANF). Most learned video compression systems adopt the same hybrid-based coding architecture as the traditional codecs. Recent research on conditional coding has shown the sub-optimality of the hybrid-based coding and opens up opportunities for deep generative models to take a key role in creating new coding frameworks. CANF-VC represents a new attempt that leverages the conditional ANF to learn a video generative model for conditional inter-frame coding. We choose ANF because it is a special type of generative model, which includes variational autoencoder as a special case and is able to achieve better expressiveness. CANF-VC also extends the idea of conditional coding to motion coding, forming a purely conditional coding framework. Extensive experimental results on commonly used datasets confirm the superiority of CANF-VC to the state-of-the-art methods.
    Synergistic Self-supervised and Quantization Learning. (arXiv:2207.05432v1 [cs.CV])
    With the success of self-supervised learning (SSL), it has become a mainstream paradigm to fine-tune from self-supervised pretrained models to boost the performance on downstream tasks. However, we find that current SSL models suffer severe accuracy drops when performing low-bit quantization, prohibiting their deployment in resource-constrained applications. In this paper, we propose a method called synergistic self-supervised and quantization learning (SSQL) to pretrain quantization-friendly self-supervised models facilitating downstream deployment. SSQL contrasts the features of the quantized and full precision models in a self-supervised fashion, where the bit-width for the quantized model is randomly selected in each step. SSQL not only significantly improves the accuracy when quantized to lower bit-widths, but also boosts the accuracy of full precision models in most cases. By only training once, SSQL can then benefit various downstream tasks at different bit-widths simultaneously. Moreover, the bit-width flexibility is achieved without additional storage overhead, requiring only one copy of weights during training and inference. We theoretically analyze the optimization process of SSQL, and conduct exhaustive experiments on various benchmarks to further demonstrate the effectiveness of our method. Our code is available at https://github.com/megvii-research/SSQL-ECCV2022.
    Representation learning with function call graph transformations for malware open set recognition. (arXiv:2205.06918v3 [cs.CR] UPDATED)
    Open set recognition (OSR) problem has been a challenge in many machine learning (ML) applications, such as security. As new/unknown malware families occur regularly, it is difficult to exhaust samples that cover all the classes for the training process in ML systems. An advanced malware classification system should classify the known classes correctly while sensitive to the unknown class. In this paper, we introduce a self-supervised pre-training approach for the OSR problem in malware classification. We propose two transformations for the function call graph (FCG) based malware representations to facilitate the pretext task. Also, we present a statistical thresholding approach to find the optimal threshold for the unknown class. Moreover, the experiment results indicate that our proposed pre-training process can improve different performances of different downstream loss functions for the OSR problem.
    Susceptibility of Continual Learning Against Adversarial Attacks. (arXiv:2207.05225v1 [cs.LG])
    The recent advances in continual (incremental or lifelong) learning have concentrated on the prevention of forgetting that can lead to catastrophic consequences, but there are two outstanding challenges that must be addressed. The first is the evaluation of the robustness of the proposed methods. The second is ensuring the security of learned tasks remains largely unexplored. This paper presents a comprehensive study of the susceptibility of the continually learned tasks (including both current and previously learned tasks) that are vulnerable to forgetting. Such vulnerability of tasks against adversarial attacks raises profound issues in data integrity and privacy. We consider the task incremental learning (Task-IL) scenario and explore three regularization-based experiments, three replay-based experiments, and one hybrid technique based on the reply and exemplar approach. We examine the robustness of these methods. In particular, we consider cases where we demonstrate that any class belonging to the current or previously learned tasks is prone to misclassification. Our observations highlight the potential limitations of existing Task-IL approaches. Our empirical study recommends that the research community consider the robustness of the proposed continual learning approaches and invest extensive efforts in mitigating catastrophic forgetting.
    Simultaneously Learning Stochastic and Adversarial Bandits under the Position-Based Model. (arXiv:2207.05437v1 [cs.LG])
    Online learning to rank (OLTR) interactively learns to choose lists of items from a large collection based on certain click models that describe users' click behaviors. Most recent works for this problem focus on the stochastic environment where the item attractiveness is assumed to be invariant during the learning process. In many real-world scenarios, however, the environment could be dynamic or even arbitrarily changing. This work studies the OLTR problem in both stochastic and adversarial environments under the position-based model (PBM). We propose a method based on the follow-the-regularized-leader (FTRL) framework with Tsallis entropy and develop a new self-bounding constraint especially designed for PBM. We prove the proposed algorithm simultaneously achieves $O(\log{T})$ regret in the stochastic environment and $O(m\sqrt{nT})$ regret in the adversarial environment, where $T$ is the number of rounds, $n$ is the number of items and $m$ is the number of positions. We also provide a lower bound of order $\Omega(m\sqrt{nT})$ for adversarial PBM, which matches our upper bound and improves over the state-of-the-art lower bound. The experiments show that our algorithm could simultaneously learn in both stochastic and adversarial environments and is competitive compared to existing methods that are designed for a single environment.
    A Benchmark dataset for predictive maintenance. (arXiv:2207.05466v1 [cs.LG])
    The paper describes the Railway data set, an outcome of a Predictive Maintenance project with an urban metro public transportation service in Porto, Portugal. The data was collected between 2020 and 2022 that aimed to develop machine learning methods for online anomaly detection and failure prediction. By capturing several analogic sensor signals (pressure, temperature, current consumption), digital signals (control signals, discrete signals), and GPS information (latitude, longitude, and speed), we provide a framework that can be easily used and developed for the new machine learning methods. We believe this dataset contains some interesting characteristics and can be a good benchmark for predictive maintenance models.
    Transformer Compressed Sensing via Global Image Tokens. (arXiv:2203.12861v3 [cs.CV] UPDATED)
    Convolutional neural networks (CNN) have demonstrated outstanding Compressed Sensing (CS) performance compared to traditional, hand-crafted methods. However, they are broadly limited in terms of generalisability, inductive bias and difficulty to model long distance relationships. Transformer neural networks (TNN) overcome such issues by implementing an attention mechanism designed to capture dependencies between inputs. However, high-resolution tasks typically require vision Transformers (ViT) to decompose an image into patch-based tokens, limiting inputs to inherently local contexts. We propose a novel image decomposition that naturally embeds images into low-resolution inputs. These Kaleidoscope tokens (KD) provide a mechanism for global attention, at the same computational cost as a patch-based approach. To showcase this development, we replace CNN components in a well-known CS-MRI neural network with TNN blocks and demonstrate the improvements afforded by KD. We also propose an ensemble of image tokens, which enhance overall image quality and reduces model size. Supplementary material is available: https://github.com/uqmarlonbran/TCS.git
    Improved Rates for Differentially Private Stochastic Convex Optimization with Heavy-Tailed Data. (arXiv:2106.01336v5 [cs.LG] UPDATED)
    We study stochastic convex optimization with heavy-tailed data under the constraint of differential privacy (DP). Most prior work on this problem is restricted to the case where the loss function is Lipschitz. Instead, as introduced by Wang, Xiao, Devadas, and Xu \cite{WangXDX20}, we study general convex loss functions with the assumption that the distribution of gradients has bounded $k$-th moments. We provide improved upper bounds on the excess population risk under concentrated DP for convex and strongly convex loss functions. Along the way, we derive new algorithms for private mean estimation of heavy-tailed distributions, under both pure and concentrated DP. Finally, we prove nearly-matching lower bounds for private stochastic convex optimization with strongly convex losses and mean estimation, showing new separations between pure and concentrated DP.
    EAGAN: Efficient Two-stage Evolutionary Architecture Search for GANs. (arXiv:2111.15097v2 [cs.CV] UPDATED)
    Generative adversarial networks (GANs) have proven successful in image generation tasks. However, GAN training is inherently unstable. Although many works try to stabilize it by manually modifying GAN architecture, it requires much expertise. Neural architecture search (NAS) has become an attractive solution to search GANs automatically. The early NAS-GANs search only generators to reduce search complexity but lead to a sub-optimal GAN. Some recent works try to search both generator (G) and discriminator (D), but they suffer from the instability of GAN training. To alleviate the instability, we propose an efficient two-stage evolutionary algorithm-based NAS framework to search GANs, namely EAGAN. We decouple the search of G and D into two stages, where stage-1 searches G with a fixed D and adopts the many-to-one training strategy, and stage-2 searches D with the optimal G found in stage-1 and adopts the one-to-one training and weight-resetting strategies to enhance the stability of GAN training. Both stages use the non-dominated sorting method to produce Pareto-front architectures under multiple objectives (e.g., model size, Inception Score (IS), and Fr\'echet Inception Distance (FID)). EAGAN is applied to the unconditional image generation task and can efficiently finish the search on the CIFAR-10 dataset in 1.2 GPU days. Our searched GANs achieve competitive results (IS=8.81$\pm$0.10, FID=9.91) on the CIFAR-10 dataset and surpass prior NAS-GANs on the STL-10 dataset (IS=10.44$\pm$0.087, FID=22.18). Source code: https://github.com/marsggbo/EAGAN.
    Learning Continuous Grasping Function with a Dexterous Hand from Human Demonstrations. (arXiv:2207.05053v2 [cs.RO] UPDATED)
    We propose to learn to generate grasping motion for manipulation with a dexterous hand using implicit functions. With continuous time inputs, the model can generate a continuous and smooth grasping plan. We name the proposed model Continuous Grasping Function (CGF). CGF is learned via generative modeling with a Conditional Variational Autoencoder using 3D human demonstrations. We will first convert the large-scale human-object interaction trajectories to robot demonstrations via motion retargeting, and then use these demonstrations to train CGF. During inference, we perform sampling with CGF to generate different grasping plans in the simulator and select the successful ones to transfer to the real robot. By training on diverse human data, our CGF allows generalization to manipulate multiple objects. Compared to previous planning algorithms, CGF is more efficient and achieves significant improvement on success rate when transferred to grasping with the real Allegro Hand. Our project page is at https://jianglongye.com/cgf .
    A semi-supervised geometric-driven methodology for supervised fishing activity detection on multi-source AIS tracking messages. (arXiv:2207.05514v1 [cs.LG])
    Automatic Identification System (AIS) messages are useful for tracking vessel activity across oceans worldwide using radio links and satellite transceivers. Such data plays a significant role in tracking vessel activity and mapping mobility patterns such as those found in fishing. Accordingly, this paper proposes a geometric-driven semi-supervised approach for fishing activity detection from AIS data. Through the proposed methodology we show how to explore the information included in the messages to extract features describing the geometry of the vessel route. To this end, we leverage the unsupervised nature of cluster analysis to label the trajectory geometry highlighting the changes in the vessel's moving pattern which tends to indicate fishing activity. The labels obtained by the proposed unsupervised approach are used to detect fishing activities, which we approach as a time-series classification task. In this context, we propose a solution using recurrent neural networks on AIS data streams with roughly 87% of the overall $F$-score on the whole trajectories of 50 different unseen fishing vessels. Such results are accompanied by a broad benchmark study assessing the performance of different Recurrent Neural Network (RNN) architectures. In conclusion, this work contributes by proposing a thorough process that includes data preparation, labeling, data modeling, and model validation. Therefore, we present a novel solution for mobility pattern detection that relies upon unfolding the trajectory in time and observing their inherent geometry.
    Contrastive Learning for Online Semi-Supervised General Continual Learning. (arXiv:2207.05615v1 [cs.LG])
    We study Online Continual Learning with missing labels and propose SemiCon, a new contrastive loss designed for partly labeled data. We demonstrate its efficiency by devising a memory-based method trained on an unlabeled data stream, where every data added to memory is labeled using an oracle. Our approach outperforms existing semi-supervised methods when few labels are available, and obtain similar results to state-of-the-art supervised methods while using only 2.6% of labels on Split-CIFAR10 and 10% of labels on Split-CIFAR100.
    Insights into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement. (arXiv:2206.13310v2 [eess.AS] UPDATED)
    The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter, which means that the restriction of a linear processing model and that of a separate processing of spatial and tempo-spectral information can potentially be overcome. However, the internal mechanisms that lead to good performance of such data-driven filters for multi-channel speech enhancement are not well understood. Therefore, in this work, we analyse the properties of a non-linear spatial filter realized by a DNN as well as its interdependency with temporal and spectral processing by carefully controlling the information sources (spatial, spectral, and temporal) available to the network. We confirm the superiority of a non-linear spatial processing model, which outperforms an oracle linear spatial filter in a challenging speaker extraction scenario for a low number of microphones by 0.24 POLQA score. Our analyses reveal that in particular spectral information should be processed jointly with spatial information as this increases the spatial selectivity of the filter. Our systematic evaluation then leads to a simple network architecture, that outperforms state-of-the-art network architectures on a speaker extraction task by 0.22 POLQA score and by 0.32 POLQA score on the CHiME3 data.
    Multi-Model Federated Learning with Provable Guarantees. (arXiv:2207.04330v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a variant of distributed learning where edge devices collaborate to learn a model without sharing their data with the central server or each other. We refer to the process of training multiple independent models simultaneously in a federated setting using a common pool of clients as multi-model FL. In this work, we propose two variants of the popular FedAvg algorithm for multi-model FL, with provable convergence guarantees. We further show that for the same amount of computation, multi-model FL can have better performance than training each model separately. We supplement our theoretical results with experiments in strongly convex, convex, and non-convex settings.
    Offline Equilibrium Finding. (arXiv:2207.05285v1 [cs.AI])
    Offline reinforcement learning (Offline RL) is an emerging field that has recently begun gaining attention across various application domains due to its ability to learn behavior from earlier collected datasets. Using logged data is imperative when further interaction with the environment is expensive (computationally or otherwise), unsafe, or entirely unfeasible. Offline RL proved very successful, paving a path to solving previously intractable real-world problems, and we aim to generalize this paradigm to a multi-agent or multiplayer-game setting. Very little research has been done in this area, as the progress is hindered by the lack of standardized datasets and meaningful benchmarks. In this work, we coin the term offline equilibrium finding (OEF) to describe this area and construct multiple datasets consisting of strategies collected across a wide range of games using several established methods. We also propose a benchmark method -- an amalgamation of a behavior-cloning and a model-based algorithm. Our two model-based algorithms -- OEF-PSRO and OEF-CFR -- are adaptations of the widely-used equilibrium finding algorithms Deep CFR and PSRO in the context of offline learning. In the empirical part, we evaluate the performance of the benchmark algorithms on the constructed datasets. We hope that our efforts may help to accelerate research in large-scale equilibrium finding. Datasets and code are available at https://github.com/SecurityGames/oef.
    Inner Monologue: Embodied Reasoning through Planning with Language Models. (arXiv:2207.05608v1 [cs.RO])
    Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.
    Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning. (arXiv:2207.05506v1 [eess.AS])
    State-of-the-art speaker verification systems are inherently dependent on some kind of human supervision as they are trained on massive amounts of labeled data. However, manually annotating utterances is slow, expensive and not scalable to the amount of data available today. In this study, we explore self-supervised learning for speaker verification by learning representations directly from raw audio. The objective is to produce robust speaker embeddings that have small intra-speaker and large inter-speaker variance. Our approach is based on recent information maximization learning frameworks and an intensive data augmentation pre-processing step. We evaluate the ability of these methods to work without contrastive samples before showing that they achieve better performance when combined with a contrastive loss. Furthermore, we conduct experiments to show that our method reaches competitive results compared to existing techniques and can get better performances compared to a supervised baseline when fine-tuned with a small portion of labeled data.
    Truly Sparse Neural Networks at Scale. (arXiv:2102.01732v2 [cs.LG] UPDATED)
    Recently, sparse training methods have started to be established as a de facto approach for training and inference efficiency in artificial neural networks. Yet, this efficiency is just in theory. In practice, everyone uses a binary mask to simulate sparsity since the typical deep learning software and hardware are optimized for dense matrix operations. In this paper, we take an orthogonal approach, and we show that we can train truly sparse neural networks to harvest their full potential. To achieve this goal, we introduce three novel contributions, specially designed for sparse neural networks: (1) a parallel training algorithm and its corresponding sparse implementation from scratch, (2) an activation function with non-trainable parameters to favour the gradient flow, and (3) a hidden neurons importance metric to eliminate redundancies. All in one, we are able to break the record and to train the largest neural network ever trained in terms of representational power -- reaching the bat brain size. The results show that our approach has state-of-the-art performance while opening the path for an environmentally friendly artificial intelligence era.
    Propagating State Uncertainty Through Trajectory Forecasting. (arXiv:2110.03267v4 [cs.RO] UPDATED)
    Uncertainty pervades through the modern robotic autonomy stack, with nearly every component (e.g., sensors, detection, classification, tracking, behavior prediction) producing continuous or discrete probabilistic distributions. Trajectory forecasting, in particular, is surrounded by uncertainty as its inputs are produced by (noisy) upstream perception and its outputs are predictions that are often probabilistic for use in downstream planning. However, most trajectory forecasting methods do not account for upstream uncertainty, instead taking only the most-likely values. As a result, perceptual uncertainties are not propagated through forecasting and predictions are frequently overconfident. To address this, we present a novel method for incorporating perceptual state uncertainty in trajectory forecasting, a key component of which is a new statistical distance-based loss function which encourages predicting uncertainties that better match upstream perception. We evaluate our approach both in illustrative simulations and on large-scale, real-world data, demonstrating its efficacy in propagating perceptual state uncertainty through prediction and producing more calibrated predictions.
    High-dimensional Inference for Dynamic Treatment Effects. (arXiv:2110.04924v3 [stat.ME] UPDATED)
    This paper proposes a confidence interval construction for heterogeneous treatment effects in the context of multi-stage experiments with $N$ samples and high-dimensional, $d$, confounders. Our focus is on the case of $d\gg N$, but the results obtained also apply to low-dimensional cases. We showcase that the bias of regularized estimation, unavoidable in high-dimensional covariate spaces, is mitigated with a simple double-robust score. In this way, no additional bias removal is necessary, and we obtain root-$N$ inference results while allowing multi-stage interdependency of the treatments and covariates. Memoryless property is also not assumed; treatment can possibly depend on all previous treatment assignments and all previous multi-stage confounders. Our results rely on certain sparsity assumptions of the underlying dependencies. We discover new product rate conditions necessary for robust inference with dynamic treatments.
    Grounding Aleatoric Uncertainty in Unsupervised Environment Design. (arXiv:2207.05219v1 [cs.LG])
    Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.
    Structure-Enhanced Pop Music Generation via Harmony-Aware Learning. (arXiv:2109.06441v2 [cs.SD] UPDATED)
    Pop music generation has always been an attractive topic for both musicians and scientists for a long time. However, automatically composing pop music with a satisfactory structure is still a challenging issue. In this paper, we propose to leverage harmony-aware learning for structure-enhanced pop music generation. On the one hand, one of the participants of harmony, chord, represents the harmonic set of multiple notes, which is integrated closely with the spatial structure of music, the texture. On the other hand, the other participant of harmony, chord progression, usually accompanies the development of the music, which promotes the temporal structure of music, the form. Moreover, when chords evolve into chord progression, the texture and form can be bridged by the harmony naturally, which contributes to the joint learning of the two structures. Furthermore, we propose the Harmony-Aware Hierarchical Music Transformer (HAT), which can exploit the structure adaptively from the music, and make the musical tokens interact hierarchically to enhance the structure in multi-level musical elements. Experimental results reveal that compared to the existing methods, HAT owns a much better understanding of the structure and it can also improve the quality of generated music, especially in the form and texture.
    Sliced-Wasserstein normalizing flows: beyond maximum likelihood training. (arXiv:2207.05468v1 [stat.ML])
    Despite their advantages, normalizing flows generally suffer from several shortcomings including their tendency to generate unrealistic data (e.g., images) and their failing to detect out-of-distribution data. One reason for these deficiencies lies in the training strategy which traditionally exploits a maximum likelihood principle only. This paper proposes a new training paradigm based on a hybrid objective function combining the maximum likelihood principle (MLE) and a sliced-Wasserstein distance. Results obtained on synthetic toy examples and real image data sets show better generative abilities in terms of both likelihood and visual aspects of the generated samples. Reciprocally, the proposed approach leads to a lower likelihood of out-of-distribution data, demonstrating a greater data fidelity of the resulting flows.
    Markovian Gaussian Process Variational Autoencoders. (arXiv:2207.05543v1 [cs.LG])
    Deep generative models are widely used for modelling high-dimensional time series, such as video animations, audio and climate data. Sequential variational autoencoders have been successfully considered for many applications, with many variant models relying on discrete-time methods and recurrent neural networks (RNNs). On the other hand, continuous-time methods have recently gained attraction, especially in the context of irregularly-sampled time series, where they can better handle the data than discrete-time methods. One such class are Gaussian process variational autoencoders (GPVAEs), where the VAE prior is set as a Gaussian process (GPs), allowing inductive biases to be explicitly encoded via the kernel function and interpretability of the latent space. However, a major limitation of GPVAEs is that it inherits the same cubic computational cost as GPs. In this work, we leverage the equivalent discrete state space representation of Markovian GPs to enable a linear-time GP solver via Kalman filtering and smoothing. We show via corrupt and missing frames tasks that our method performs favourably, especially on the latter where it outperforms RNN-based models.
    A Newton-CG based barrier method for finding a second-order stationary point of nonconvex conic optimization with complexity guarantees. (arXiv:2207.05697v1 [math.OC])
    In this paper we consider finding an approximate second-order stationary point (SOSP) of nonconvex conic optimization that minimizes a twice differentiable function over the intersection of an affine subspace and a convex cone. In particular, we propose a Newton-conjugate gradient (Newton-CG) based barrier method for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of this problem. Our method is not only implementable, but also achieves an iteration complexity of ${\cal O}(\epsilon^{-3/2})$, which matches the best known iteration complexity of second-order methods for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of unconstrained nonconvex optimization. The operation complexity of $\widetilde{\cal O}(\epsilon^{-3/2}\min\{n,\epsilon^{-1/4}\})$, measured by the amount of fundamental operations, is also established for our method.
    The Neural-Prediction based Acceleration Algorithm of Column Generation for Graph-Based Set Covering Problems. (arXiv:2207.01411v2 [cs.LG] UPDATED)
    Set covering problem is an important class of combinatorial optimization problems, which has been widely applied and studied in many fields. In this paper, we propose an improved column generation algorithm with neural prediction (CG-P) for solving graph-based set covering problems. We leverage a graph neural network based neural prediction model to predict the probability to be included in the final solution for each edge. Our CG-P algorithm constructs a reduced graph that only contains the edges with higher predicted probability, and this graph reduction process significantly speeds up the solution process. We evaluate the CG-P algorithm on railway crew scheduling problems and it outperforms the baseline column generation algorithm. We provide two solution modes for our CG-P algorithm. In the optimal mode, we can obtain a solution with an optimality guarantee while reducing the time cost to 63.12%. In the fast mode, we can obtain a sub-optimal solution with a 7.62% optimality gap in only 2.91% computation time.
    Joint NMF for Identification of Shared Features in Datasets and a Dataset Distance Measure. (arXiv:2207.05112v1 [cs.LG])
    In this paper, we derive a new method for determining shared features of datasets by employing joint non-negative matrix factorization and analyzing the resulting factorizations. Our approach uses the joint factorization of two dataset matrices $X_1,X_2$ into non-negative matrices $X_1 = AS_1, X_2 = AS_2$ to derive a similarity measure that determines how well a shared basis for $X_1, X_2$ approximates each dataset. We also propose a dataset distance measure built upon this method and the learned factorization. Our method is able to successfully identity differences in structure in both image and text datasets. Potential applications include classification, detecting plagiarism or other manipulation, and learning relationships between data sets.
    Revisiting Inlier and Outlier Specification for Improved Out-of-Distribution Detection. (arXiv:2207.05286v1 [cs.CV])
    Accurately detecting out-of-distribution (OOD) data with varying levels of semantic and covariate shifts with respect to the in-distribution (ID) data is critical for deployment of safe and reliable models. This is particularly the case when dealing with highly consequential applications (e.g. medical imaging, self-driving cars, etc). The goal is to design a detector that can accept meaningful variations of the ID data, while also rejecting examples from OOD regimes. In practice, this dual objective can be realized by enforcing consistency using an appropriate scoring function (e.g., energy) and calibrating the detector to reject a curated set of OOD data (referred to as outlier exposure or shortly OE). While OE methods are widely adopted, assembling representative OOD datasets is both costly and challenging due to the unpredictability of real-world scenarios, hence the recent trend of designing OE-free detectors. In this paper, we make a surprising finding that controlled generalization to ID variations and exposure to diverse (synthetic) outlier examples are essential to simultaneously improving semantic and modality shift detection. In contrast to existing methods, our approach samples inliers in the latent space, and constructs outlier examples via negative data augmentation. Through a rigorous empirical study on medical imaging benchmarks (MedMNIST, ISIC2019 and NCT), we demonstrate significant performance gains ($15\% - 35\%$ in AUROC) over existing OE-free, OOD detection approaches under both semantic and modality shifts.
    PAC-Bayesian Domain Adaptation Bounds for Multiclass Learners. (arXiv:2207.05685v1 [cs.LG])
    Multiclass neural networks are a common tool in modern unsupervised domain adaptation, yet an appropriate theoretical description for their non-uniform sample complexity is lacking in the adaptation literature. To fill this gap, we propose the first PAC-Bayesian adaptation bounds for multiclass learners. We facilitate practical use of our bounds by also proposing the first approximation techniques for the multiclass distribution divergences we consider. For divergences dependent on a Gibbs predictor, we propose additional PAC-Bayesian adaptation bounds which remove the need for inefficient Monte-Carlo estimation. Empirically, we test the efficacy of our proposed approximation techniques as well as some novel design-concepts which we include in our bounds. Finally, we apply our bounds to analyze a common adaptation algorithm that uses neural networks.
    Adversarial Robustness Assessment of NeuroEvolution Approaches. (arXiv:2207.05451v1 [cs.NE])
    NeuroEvolution automates the generation of Artificial Neural Networks through the application of techniques from Evolutionary Computation. The main goal of these approaches is to build models that maximize predictive performance, sometimes with an additional objective of minimizing computational complexity. Although the evolved models achieve competitive results performance-wise, their robustness to adversarial examples, which becomes a concern in security-critical scenarios, has received limited attention. In this paper, we evaluate the adversarial robustness of models found by two prominent NeuroEvolution approaches on the CIFAR-10 image classification task: DENSER and NSGA-Net. Since the models are publicly available, we consider white-box untargeted attacks, where the perturbations are bounded by either the L2 or the Linfinity-norm. Similarly to manually-designed networks, our results show that when the evolved models are attacked with iterative methods, their accuracy usually drops to, or close to, zero under both distance metrics. The DENSER model is an exception to this trend, showing some resistance under the L2 threat model, where its accuracy only drops from 93.70% to 18.10% even with iterative attacks. Additionally, we analyzed the impact of pre-processing applied to the data before the first layer of the network. Our observations suggest that some of these techniques can exacerbate the perturbations added to the original inputs, potentially harming robustness. Thus, this choice should not be neglected when automatically designing networks for applications where adversarial attacks are prone to occur.
    A Computational Model for Logical Analysis of Data. (arXiv:2207.05664v1 [cs.LG])
    Initially introduced by Peter Hammer, Logical Analysis of Data is a methodology that aims at computing a logical justification for dividing a group of data in two groups of observations, usually called the positive and negative groups. Consider this partition into positive and negative groups as the description of a partially defined Boolean function; the data is then processed to identify a subset of attributes, whose values may be used to characterize the observations of the positive groups against those of the negative group. LAD constitutes an interesting rule-based learning alternative to classic statistical learning techniques and has many practical applications. Nevertheless, the computation of group characterization may be costly, depending on the properties of the data instances. A major aim of our work is to provide effective tools for speeding up the computations, by computing some \emph{a priori} probability that a given set of attributes does characterize the positive and negative groups. To this effect, we propose several models for representing the data set of observations, according to the information we have on it. These models, and the probabilities they allow us to compute, are also helpful for quickly assessing some properties of the real data at hand; furthermore they may help us to better analyze and understand the computational difficulties encountered by solving methods. Once our models have been established, the mathematical tools for computing probabilities come from Analytic Combinatorics. They allow us to express the desired probabilities as ratios of generating functions coefficients, which then provide a quick computation of their numerical values. A further, long-range goal of this paper is to show that the methods of Analytic Combinatorics can help in analyzing the performance of various algorithms in LAD and related fields.  ( 3 min )
    A developmental approach for training deep belief networks. (arXiv:2207.05473v1 [cs.LG])
    Deep belief networks (DBNs) are stochastic neural networks that can extract rich internal representations of the environment from the sensory data. DBNs had a catalytic effect in triggering the deep learning revolution, demonstrating for the very first time the feasibility of unsupervised learning in networks with many layers of hidden neurons. Thanks to their biological and cognitive plausibility, these hierarchical architectures have been also successfully exploited to build computational models of human perception and cognition in a variety of domains. However, learning in DBNs is usually carried out in a greedy, layer-wise fashion, which does not allow to simulate the holistic development of cortical circuits. Here we present iDBN, an iterative learning algorithm for DBNs that allows to jointly update the connection weights across all layers of the hierarchy. We test our algorithm on two different sets of visual stimuli, and we show that network development can also be tracked in terms of graph theoretical properties. DBNs trained using our iterative approach achieve a final performance comparable to that of the greedy counterparts, at the same time allowing to accurately analyze the gradual development of internal representations in the generative model. Our work paves the way to the use of iDBN for modeling neurocognitive development.  ( 2 min )
    Bootstrapping a User-Centered Task-Oriented Dialogue System. (arXiv:2207.05223v1 [cs.CL])
    We present TacoBot, a task-oriented dialogue system built for the inaugural Alexa Prize TaskBot Challenge, which assists users in completing multi-step cooking and home improvement tasks. TacoBot is designed with a user-centered principle and aspires to deliver a collaborative and accessible dialogue experience. Towards that end, it is equipped with accurate language understanding, flexible dialogue management, and engaging response generation. Furthermore, TacoBot is backed by a strong search engine and an automated end-to-end test suite. In bootstrapping the development of TacoBot, we explore a series of data augmentation strategies to train advanced neural language processing models and continuously improve the dialogue experience with collected real conversations. At the end of the semifinals, TacoBot achieved an average rating of 3.55/5.0.  ( 2 min )
    Language Models (Mostly) Know What They Know. (arXiv:2207.05221v1 [cs.CL])
    We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and to the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.  ( 3 min )
    Learning an evolved mixture model for task-free continual learning. (arXiv:2207.05080v1 [cs.LG])
    Recently, continual learning (CL) has gained significant interest because it enables deep learning models to acquire new knowledge without forgetting previously learnt information. However, most existing works require knowing the task identities and boundaries, which is not realistic in a real context. In this paper, we address a more challenging and realistic setting in CL, namely the Task-Free Continual Learning (TFCL) in which a model is trained on non-stationary data streams with no explicit task information. To address TFCL, we introduce an evolved mixture model whose network architecture is dynamically expanded to adapt to the data distribution shift. We implement this expansion mechanism by evaluating the probability distance between the knowledge stored in each mixture model component and the current memory buffer using the Hilbert Schmidt Independence Criterion (HSIC). We further introduce two simple dropout mechanisms to selectively remove stored examples in order to avoid memory overload while preserving memory diversity. Empirical results demonstrate that the proposed approach achieves excellent performance.  ( 2 min )
    Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning. (arXiv:2207.05480v1 [cs.LG])
    In real-world robotics applications, Reinforcement Learning (RL) agents are often unable to generalise to environment variations that were not observed during training. This issue is intensified for image-based RL where a change in one variable, such as the background colour, can change many pixels in the image, and in turn can change all values in the agent's internal representation of the image. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a self-supervised auxiliary task that leads to disentangled representations using the sequential nature of RL observations. We find empirically that RL algorithms with TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Due to the disentangled structure of the representation, we also find that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).  ( 2 min )
    Causal Conceptions of Fairness and their Consequences. (arXiv:2207.05302v1 [cs.LG])
    Recent work highlights the role of causality in designing equitable decision-making algorithms. It is not immediately clear, however, how existing causal conceptions of fairness relate to one another, or what the consequences are of using these definitions as design principles. Here, we first assemble and categorize popular causal definitions of algorithmic fairness into two broad families: (1) those that constrain the effects of decisions on counterfactual disparities; and (2) those that constrain the effects of legally protected characteristics, like race and gender, on decisions. We then show, analytically and empirically, that both families of definitions \emph{almost always} -- in a measure theoretic sense -- result in strongly Pareto dominated decision policies, meaning there is an alternative, unconstrained policy favored by every stakeholder with preferences drawn from a large, natural class. For example, in the case of college admissions decisions, policies constrained to satisfy causal fairness definitions would be disfavored by every stakeholder with neutral or positive preferences for both academic preparedness and diversity. Indeed, under a prominent definition of causal fairness, we prove the resulting policies require admitting all students with the same probability, regardless of academic qualifications or group membership. Our results highlight formal limitations and potential adverse consequences of common mathematical notions of causal fairness.  ( 3 min )
    Dev2vec: Representing Domain Expertise of Developers in an Embedding Space. (arXiv:2207.05132v1 [cs.SE])
    Accurate assessment of the domain expertise of developers is important for assigning the proper candidate to contribute to a project or to attend a job role. Since the potential candidate can come from a large pool, the automated assessment of this domain expertise is a desirable goal. While previous methods have had some success within a single software project, the assessment of a developer's domain expertise from contributions across multiple projects is more challenging. In this paper, we employ doc2vec to represent the domain expertise of developers as embedding vectors. These vectors are derived from different sources that contain evidence of developers' expertise, such as the description of repositories that they contributed, their issue resolving history, and API calls in their commits. We name it dev2vec and demonstrate its effectiveness in representing the technical specialization of developers. Our results indicate that encoding the expertise of developers in an embedding vector outperforms state-of-the-art methods and improves the F1-score up to 21%. Moreover, our findings suggest that ``issue resolving history'' of developers is the most informative source of information to represent the domain expertise of developers in embedding spaces.  ( 2 min )
    A Single-Loop Gradient Descent and Perturbed Ascent Algorithm for Nonconvex Functional Constrained Optimization. (arXiv:2207.05650v1 [math.OC])
    Nonconvex constrained optimization problems can be used to model a number of machine learning problems, such as multi-class Neyman-Pearson classification and constrained Markov decision processes. However, such kinds of problems are challenging because both the objective and constraints are possibly nonconvex, so it is difficult to balance the reduction of the loss value and reduction of constraint violation. Although there are a few methods that solve this class of problems, all of them are double-loop or triple-loop algorithms, and they require oracles to solve some subproblems up to certain accuracy by tuning multiple hyperparameters at each iteration. In this paper, we propose a novel gradient descent and perturbed ascent (GDPA) algorithm to solve a class of smooth nonconvex inequality constrained problems. The GDPA is a primal-dual algorithm, which only exploits the first-order information of both the objective and constraint functions to update the primal and dual variables in an alternating way. The key feature of the proposed algorithm is that it is a single-loop algorithm, where only two step-sizes need to be tuned. We show that under a mild regularity condition GDPA is able to find Karush-Kuhn-Tucker (KKT) points of nonconvex functional constrained problems with convergence rate guarantees. To the best of our knowledge, it is the first single-loop algorithm that can solve the general nonconvex smooth problems with nonconvex inequality constraints. Numerical results also showcase the superiority of GDPA compared with the best-known algorithms (in terms of both stationarity measure and feasibility of the obtained solutions).  ( 3 min )
    FreeREA: Training-Free Evolution-based Architecture Search. (arXiv:2207.05135v1 [cs.NE])
    In the last decade, most research in Machine Learning contributed to the improvement of existing models, with the aim of increasing the performance of neural networks for the solution of a variety of different tasks. However, such advancements often come at the cost of an increase of model memory and computational requirements. This represents a significant limitation for the deployability of research output in realistic settings, where the cost, the energy consumption, and the complexity of the framework play a crucial role. To solve this issue, the designer should search for models that maximise the performance while limiting its footprint. Typical approaches to reach this goal rely either on manual procedures, which cannot guarantee the optimality of the final design, or upon Neural Architecture Search algorithms to automatise the process, at the expenses of extremely high computational time. This paper provides a solution for the fast identification of a neural network that maximises the model accuracy while preserving size and computational constraints typical of tiny devices. Our approach, named FreeREA, is a custom cell-based evolution NAS algorithm that exploits an optimised combination of training-free metrics to rank architectures during the search, thus without need of model training. Our experiments, carried out on the common benchmarks NAS-Bench-101 and NATS-Bench, demonstrate that i) FreeREA is the first method able to provide very accurate models in minutes of search time; ii) it outperforms State of the Art training-based and training-free techniques in all the datasets and benchmarks considered, and iii) it can easily generalise to constrained scenarios, representing a competitive solution for fast Neural Architecture Search in generic constrained applications.  ( 3 min )
    Split Time Series into Patches: Rethinking Long-term Series Forecasting with Dateformer. (arXiv:2207.05397v1 [cs.LG])
    Time is one of the most significant characteristics of time-series, yet has received insufficient attention. Prior time-series forecasting research has mainly focused on mapping a past subseries (lookback window) to a future series (forecast window), and time of series often just play an auxiliary role even completely ignored in most cases. Due to the point-wise processing within these windows, extrapolating series to longer-term future is tough in the pattern. To overcome this barrier, we propose a brand-new time-series forecasting framework named Dateformer who turns attention to modeling time instead of following the above practice. Specifically, time-series are first split into patches by day to supervise the learning of dynamic date-representations with Date Encoder Representations from Transformers (DERT). These representations are then fed into a simple decoder to produce a coarser (or global) prediction, and used to help the model seek valuable information from the lookback window to learn a refined (or local) prediction. Dateformer obtains the final result by summing the above two parts. Our empirical studies on seven benchmarks show that the time-modeling method is more efficient for long-term series forecasting compared with sequence modeling methods. Dateformer yields state-of-the-art accuracy with a 40% remarkable relative improvement, and broadens the maximum credible forecasting range to a half-yearly level.  ( 2 min )
    LightViT: Towards Light-Weight Convolution-Free Vision Transformers. (arXiv:2207.05557v1 [cs.CV])
    Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies; and bi-dimensional channel and spatial attentions are imposed over token embeddings. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks. For example, our LightViT-T achieves 78.7% accuracy on ImageNet with only 0.7G FLOPs, outperforming PVTv2-B0 by 8.2% while 11% faster on GPU. Code is available at https://github.com/hunto/LightViT.  ( 2 min )
    Hybrid Physical-Neural ODEs for Fast N-body Simulations. (arXiv:2207.05509v1 [astro-ph.CO])
    We present a new scheme to compensate for the small-scales approximations resulting from Particle-Mesh (PM) schemes for cosmological N-body simulations. This kind of simulations are fast and low computational cost realizations of the large scale structures, but lack resolution on small scales. To improve their accuracy, we introduce an additional effective force within the differential equations of the simulation, parameterized by a Fourier-space Neural Network acting on the PM-estimated gravitational potential. We compare the results for the matter power spectrum obtained to the ones obtained by the PGD scheme (Potential gradient descent scheme). We notice a similar improvement in term of power spectrum, but we find that our approach outperforms PGD for the cross-correlation coefficients, and is more robust to changes in simulation settings (different resolutions, different cosmologies).  ( 2 min )
    Size and depth of monotone neural networks: interpolation and approximation. (arXiv:2207.05275v1 [cs.LG])
    Monotone functions and data sets arise in a variety of applications. We study the interpolation problem for monotone data sets: The input is a monotone data set with $n$ points, and the goal is to find a size and depth efficient monotone neural network, with non negative parameters and threshold units, that interpolates the data set. We show that there are monotone data sets that cannot be interpolated by a monotone network of depth $2$. On the other hand, we prove that for every monotone data set with $n$ points in $\mathbb{R}^d$, there exists an interpolating monotone network of depth $4$ and size $O(nd)$. Our interpolation result implies that every monotone function over $[0,1]^d$ can be approximated arbitrarily well by a depth-4 monotone network, improving the previous best-known construction of depth $d+1$. Finally, building on results from Boolean circuit complexity, we show that the inductive bias of having positive parameters can lead to a super-polynomial blow-up in the number of neurons when approximating monotone functions.  ( 2 min )
    On the Representation of Causal Background Knowledge and its Applications in Causal Inference. (arXiv:2207.05067v1 [cs.AI])
    Causal background knowledge about the existence or the absence of causal edges and paths is frequently encountered in observational studies. The shared directed edges and links of a subclass of Markov equivalent DAGs refined due to background knowledge can be represented by a causal maximally partially directed acyclic graph (MPDAG). In this paper, we first provide a sound and complete graphical characterization of causal MPDAGs and give a minimal representation of a causal MPDAG. Then, we introduce a novel representation called direct causal clause (DCC) to represent all types of causal background knowledge in a unified form. Using DCCs, we study the consistency and equivalency of causal background knowledge and show that any causal background knowledge set can be equivalently decomposed into a causal MPDAG plus a minimal residual set of DCCs. Polynomial-time algorithms are also provided for checking the consistency, equivalency, and finding the decomposed MPDAG and residual DCCs. Finally, with causal background knowledge, we prove a sufficient and necessary condition to identify causal effects and surprisingly find that the identifiability of causal effects only depends on the decomposed MPDAG. We also develop a local IDA-type algorithm to estimate the possible values of an unidentifiable effect. Simulations suggest that causal background knowledge can significantly improve the identifiability of causal effects.  ( 3 min )
    "Why do so?" -- A Practical Perspective on Machine Learning Security. (arXiv:2207.05164v1 [cs.LG])
    Despite the large body of academic work on machine learning security, little is known about the occurrence of attacks on machine learning systems in the wild. In this paper, we report on a quantitative study with 139 industrial practitioners. We analyze attack occurrence and concern and evaluate statistical hypotheses on factors influencing threat perception and exposure. Our results shed light on real-world attacks on deployed machine learning. On the organizational level, while we find no predictors for threat exposure in our sample, the amount of implement defenses depends on exposure to threats or expected likelihood to become a target. We also provide a detailed analysis of practitioners' replies on the relevance of individual machine learning attacks, unveiling complex concerns like unreliable decision making, business information leakage, and bias introduction into models. Finally, we find that on the individual level, prior knowledge about machine learning security influences threat perception. Our work paves the way for more research about adversarial machine learning in practice, but yields also insights for regulation and auditing.  ( 2 min )
    Learning to segment prostate cancer by aggressiveness from scribbles in bi-parametric MRI. (arXiv:2207.05056v1 [eess.IV])
    In this work, we propose a deep U-Net based model to tackle the challenging task of prostate cancer segmentation by aggressiveness in MRI based on weak scribble annotations. This model extends the size constraint loss proposed by Kervadec et al. 1 in the context of multiclass detection and segmentation task. This model is of high clinical interest as it allows training on prostate biopsy samples and avoids time-consuming full annotation process. Performance is assessed on a private dataset (219 patients) where the full ground truth is available as well as on the ProstateX-2 challenge database, where only biopsy results at different localisations serve as reference. We show that we can approach the fully-supervised baseline in grading the lesions by using only 6.35% of voxels for training. We report a lesion-wise Cohen's kappa score of 0.29 $\pm$ 0.07 for the weak model versus 0.32 $\pm$ 0.05 for the baseline. We also report a kappa score (0.276 $\pm$ 0.037) on the ProstateX-2 challenge dataset with our weak U-Net trained on a combination of ProstateX-2 and our dataset, which is the highest reported value on this challenge dataset for a segmentation task to our knowledge.  ( 3 min )
    Accelerating Large-Scale Graph-based Nearest Neighbor Search on a Computational Storage Platform. (arXiv:2207.05241v1 [cs.AR])
    K-nearest neighbor search is one of the fundamental tasks in various applications and the hierarchical navigable small world (HNSW) has recently drawn attention in large-scale cloud services, as it easily scales up the database while offering fast search. On the other hand, a computational storage device (CSD) that combines programmable logic and storage modules on a single board becomes popular to address the data bandwidth bottleneck of modern computing systems. In this paper, we propose a computational storage platform that can accelerate a large-scale graph-based nearest neighbor search algorithm based on SmartSSD CSD. To this end, we modify the algorithm more amenable on the hardware and implement two types of accelerators using HLS- and RTL-based methodology with various optimization methods. In addition, we scale up the proposed platform to have 4 SmartSSDs and apply graph parallelism to boost the system performance further. As a result, the proposed computational storage platform achieves 75.59 query per second throughput for the SIFT1B dataset at 258.66W power dissipation, which is 12.83x and 17.91x faster and 10.43x and 24.33x more energy efficient than the conventional CPU-based and GPU-based server platform, respectively. With multi-terabyte storage and custom acceleration capability, we believe that the proposed computational storage platform is a promising solution for cost-sensitive cloud datacenters.  ( 3 min )
    Fourier Neural Operator with Learned Deformations for PDEs on General Geometries. (arXiv:2207.05209v1 [cs.LG])
    Deep learning surrogate models have shown promise in solving partial differential equations (PDEs). Among them, the Fourier neural operator (FNO) achieves good accuracy, and is significantly faster compared to numerical solvers, on a variety of PDEs, such as fluid flows. However, the FNO uses the Fast Fourier transform (FFT), which is limited to rectangular domains with uniform grids. In this work, we propose a new framework, viz., geo-FNO, to solve PDEs on arbitrary geometries. Geo-FNO learns to deform the input (physical) domain, which may be irregular, into a latent space with a uniform grid. The FNO model with the FFT is applied in the latent space. The resulting geo-FNO model has both the computation efficiency of FFT and the flexibility of handling arbitrary geometries. Our geo-FNO is also flexible in terms of its input formats, viz., point clouds, meshes, and design parameters are all valid inputs. We consider a variety of PDEs such as the Elasticity, Plasticity, Euler's, and Navier-Stokes equations, and both forward modeling and inverse design problems. Geo-FNO is $10^5$ times faster than the standard numerical solvers and twice more accurate compared to direct interpolation on existing ML-based PDE solvers such as the standard FNO.  ( 2 min )
    DAUX: a Density-based Approach for Uncertainty eXplanations. (arXiv:2207.05161v1 [cs.LG])
    Uncertainty quantification (UQ) is essential for creating trustworthy machine learning models. Recent years have seen a steep rise in UQ methods that can flag suspicious examples, however, it is often unclear what exactly these methods identify. In this work, we propose an assumption-light method for interpreting UQ models themselves. We introduce the confusion density matrix -- a kernel-based approximation of the misclassification density -- and use this to categorize suspicious examples identified by a given UQ method into three classes: out-of-distribution (OOD) examples, boundary (Bnd) examples, and examples in regions of high in-distribution misclassification (IDM). Through extensive experiments, we shed light on existing UQ methods and show that the cause of the uncertainty differs across models. Additionally, we show how the proposed framework can make use of the categorized examples to improve predictive performance.  ( 2 min )
    Adaptive Graph Spatial-Temporal Transformer Network for Traffic Flow Forecasting. (arXiv:2207.05064v1 [cs.LG])
    Traffic flow forecasting on graphs has real-world applications in many fields, such as transportation system and computer networks. Traffic forecasting can be highly challenging due to complex spatial-temporal correlations and non-linear traffic patterns. Existing works mostly model such spatial-temporal dependencies by considering spatial correlations and temporal correlations separately and fail to model the direct spatial-temporal correlations. Inspired by the recent success of transformers in the graph domain, in this paper, we propose to directly model the cross-spatial-temporal correlations on the spatial-temporal graph using local multi-head self-attentions. To reduce the time complexity, we set the attention receptive field to the spatially neighboring nodes, and we also introduce an adaptive graph to capture the hidden spatial-temporal dependencies. Based on these attention mechanisms, we propose a novel Adaptive Graph Spatial-Temporal Transformer Network (ASTTN), which stacks multiple spatial-temporal attention layers to apply self-attention on the input graph, followed by linear layers for predictions. Experimental results on public traffic network datasets, METR-LA PEMS-BAY, PeMSD4, and PeMSD7, demonstrate the superior performance of our model.  ( 2 min )
    Discovering Domain Disentanglement for Generalized Multi-source Domain Adaptation. (arXiv:2207.05070v1 [cs.LG])
    A typical multi-source domain adaptation (MSDA) approach aims to transfer knowledge learned from a set of labeled source domains, to an unlabeled target domain. Nevertheless, prior works strictly assume that each source domain shares the identical group of classes with the target domain, which could hardly be guaranteed as the target label space is not observable. In this paper, we consider a more versatile setting of MSDA, namely Generalized Multi-source Domain Adaptation, wherein the source domains are partially overlapped, and the target domain is allowed to contain novel categories that are not presented in any source domains. This new setting is more elusive than any existing domain adaptation protocols due to the coexistence of the domain and category shifts across the source and target domains. To address this issue, we propose a variational domain disentanglement (VDD) framework, which decomposes the domain representations and semantic features for each instance by encouraging dimension-wise independence. To identify the target samples of unknown classes, we leverage online pseudo labeling, which assigns the pseudo-labels to unlabeled target data based on the confidence scores. Quantitative and qualitative experiments conducted on two benchmark datasets demonstrate the validity of the proposed framework.  ( 2 min )
    A Bipartite Graph Neural Network Approach for Scalable Beamforming Optimization. (arXiv:2207.05364v1 [eess.SP])
    Deep learning (DL) techniques have been intensively studied for the optimization of multi-user multiple-input single-output (MU-MISO) downlink systems owing to the capability of handling nonconvex formulations. However, the fixed computation structure of existing deep neural networks (DNNs) lacks flexibility with respect to the system size, i.e., the number of antennas or users. This paper develops a bipartite graph neural network (BGNN) framework, a scalable DL solution designed for multi-antenna beamforming optimization. The MU-MISO system is first characterized by a bipartite graph where two disjoint vertex sets, each of which consists of transmit antennas and users, are connected via pairwise edges. These vertex interconnection states are modeled by channel fading coefficients. Thus, a generic beamforming optimization process is interpreted as a computation task over a weight bipartite graph. This approach partitions the beamforming optimization procedure into multiple suboperations dedicated to individual antenna vertices and user vertices. Separated vertex operations lead to scalable beamforming calculations that are invariant to the system size. The vertex operations are realized by a group of DNN modules that collectively form the BGNN architecture. Identical DNNs are reused at all antennas and users so that the resultant learning structure becomes flexible to the network size. Component DNNs of the BGNN are trained jointly over numerous MU-MISO configurations with randomly varying network sizes. As a result, the trained BGNN can be universally applied to arbitrary MU-MISO systems. Numerical results validate the advantages of the BGNN framework over conventional methods.  ( 3 min )
    Photonic Reconfigurable Accelerators for Efficient Inference of CNNs with Mixed-Sized Tensors. (arXiv:2207.05278v1 [cs.AR])
    Photonic Microring Resonator (MRR) based hardware accelerators have been shown to provide disruptive speedup and energy-efficiency improvements for processing deep Convolutional Neural Networks (CNNs). However, previous MRR-based CNN accelerators fail to provide efficient adaptability for CNNs with mixed-sized tensors. One example of such CNNs is depthwise separable CNNs. Performing inferences of CNNs with mixed-sized tensors on such inflexible accelerators often leads to low hardware utilization, which diminishes the achievable performance and energy efficiency from the accelerators. In this paper, we present a novel way of introducing reconfigurability in the MRR-based CNN accelerators, to enable dynamic maximization of the size compatibility between the accelerator hardware components and the CNN tensors that are processed using the hardware components. We classify the state-of-the-art MRR-based CNN accelerators from prior works into two categories, based on the layout and relative placements of the utilized hardware components in the accelerators. We then use our method to introduce reconfigurability in accelerators from these two classes, to consequently improve their parallelism, the flexibility of efficiently mapping tensors of different sizes, speed, and overall energy efficiency. We evaluate our reconfigurable accelerators against three prior works for the area proportionate outlook (equal hardware area for all accelerators). Our evaluation for the inference of four modern CNNs indicates that our designed reconfigurable CNN accelerators provide improvements of up to 1.8x in Frames-Per-Second (FPS) and up to 1.5x in FPS/W, compared to an MRR-based accelerator from prior work.  ( 3 min )
    FedPseudo: Pseudo value-based Deep Learning Models for Federated Survival Analysis. (arXiv:2207.05247v1 [cs.LG])
    Survival analysis, time-to-event analysis, is an important problem in healthcare since it has a wide-ranging impact on patients and palliative care. Many survival analysis methods have assumed that the survival data is centrally available either from one medical center or by data sharing from multi-centers. However, the sensitivity of the patient attributes and the strict privacy laws have increasingly forbidden sharing of healthcare data. To address this challenge, the research community has looked at the solution of decentralized training and sharing of model parameters using the Federated Learning (FL) paradigm. In this paper, we study the utilization of FL for performing survival analysis on distributed healthcare datasets. Recently, the popular Cox proportional hazard (CPH) models have been adapted for FL settings; however, due to its linearity and proportional hazards assumptions, CPH models result in suboptimal performance, especially for non-linear, non-iid, and heavily censored survival datasets. To overcome the challenges of existing federated survival analysis methods, we leverage the predictive accuracy of the deep learning models and the power of pseudo values to propose a first-of-its-kind, pseudo value-based deep learning model for federated survival analysis (FSA) called FedPseudo. Furthermore, we introduce a novel approach of deriving pseudo values for survival probability in the FL settings that speeds up the computation of pseudo values. Extensive experiments on synthetic and real-world datasets show that our pseudo valued-based FL framework achieves similar performance as the best centrally trained deep survival analysis model. Moreover, our proposed FL approach obtains the best results for various censoring settings.  ( 3 min )
    Few-Shot Semantic Relation Prediction across Heterogeneous Graphs. (arXiv:2207.05068v1 [cs.LG])
    Semantic relation prediction aims to mine the implicit relationships between objects in heterogeneous graphs, which consist of different types of objects and different types of links. In real-world scenarios, new semantic relations constantly emerge and they typically appear with only a few labeled data. Since a variety of semantic relations exist in multiple heterogeneous graphs, the transferable knowledge can be mined from some existing semantic relations to help predict the new semantic relations with few labeled data. This inspires a novel problem of few-shot semantic relation prediction across heterogeneous graphs. However, the existing methods cannot solve this problem because they not only require a large number of labeled samples as input, but also focus on a single graph with a fixed heterogeneity. Targeting this novel and challenging problem, in this paper, we propose a Meta-learning based Graph neural network for Semantic relation prediction, named MetaGS. Firstly, MetaGS decomposes the graph structure between objects into multiple normalized subgraphs, then adopts a two-view graph neural network to capture local heterogeneous information and global structure information of these subgraphs. Secondly, MetaGS aggregates the information of these subgraphs with a hyper-prototypical network, which can learn from existing semantic relations and adapt to new semantic relations. Thirdly, using the well-initialized two-view graph neural network and hyper-prototypical network, MetaGS can effectively learn new semantic relations from different graphs while overcoming the limitation of few labeled data. Extensive experiments on three real-world datasets have demonstrated the superior performance of MetaGS over the state-of-the-art methods.  ( 3 min )
    A Macrocolumn Architecture Implemented with Temporal (Spiking) Neurons. (arXiv:2207.05081v1 [cs.NE])
    With the long-term goal of reverse-architecting the computational brain from the bottom up, the focus of this document is the macrocolumn abstraction layer. A basic macrocolumn architecture is developed by first describing its operation with a state machine model. Then state machine functions are implemented with spiking neurons that support temporal computation. The neuron model is based on active spiking dendrites and mirrors the Hawkins/Numenta neuron model. The architecture is demonstrated with a research benchmark in which an agent uses a macrocolumn to first learn and then navigate 2-d environments containing randomly placed features. Environments are represented in the macrocolumn as labeled directed graphs where edges connect features and labels indicate the relative displacements between them.  ( 2 min )
    Efficient Real-world Testing of Causal Decision Making via Bayesian Experimental Design for Contextual Optimisation. (arXiv:2207.05250v1 [stat.ML])
    The real-world testing of decisions made using causal machine learning models is an essential prerequisite for their successful application. We focus on evaluating and improving contextual treatment assignment decisions: these are personalised treatments applied to e.g. customers, each with their own contextual information, with the aim of maximising a reward. In this paper we introduce a model-agnostic framework for gathering data to evaluate and improve contextual decision making through Bayesian Experimental Design. Specifically, our method is used for the data-efficient evaluation of the regret of past treatment assignments. Unlike approaches such as A/B testing, our method avoids assigning treatments that are known to be highly sub-optimal, whilst engaging in some exploration to gather pertinent information. We achieve this by introducing an information-based design objective, which we optimise end-to-end. Our method applies to discrete and continuous treatments. Comparing our information-theoretic approach to baselines in several simulation studies demonstrates the superior performance of our proposed approach.  ( 2 min )
    Scaling Novel Object Detection with Weakly Supervised Detection Transformers. (arXiv:2207.05205v1 [cs.CV])
    Weakly supervised object detection (WSOD) enables object detectors to be trained using image-level class labels. However, the practical application of current WSOD models is limited, as they operate at small scales and require extensive training and refinement. We propose the Weakly Supervised Detection Transformer, which enables efficient knowledge transfer from a large-scale pretraining dataset to WSOD finetuning on hundreds of novel objects. We leverage pretrained knowledge to improve the multiple instance learning framework used in WSOD, and experiments show our approach outperforms the state-of-the-art on datasets with twice the novel classes than previously shown.  ( 2 min )
    RUSH: Robust Contrastive Learning via Randomized Smoothing. (arXiv:2207.05127v1 [cs.LG])
    Recently, adversarial training has been incorporated in self-supervised contrastive pre-training to augment label efficiency with exciting adversarial robustness. However, the robustness came at a cost of expensive adversarial training. In this paper, we show a surprising fact that contrastive pre-training has an interesting yet implicit connection with robustness, and such natural robustness in the pre trained representation enables us to design a powerful robust algorithm against adversarial attacks, RUSH, that combines the standard contrastive pre-training and randomized smoothing. It boosts both standard accuracy and robust accuracy, and significantly reduces training costs as compared with adversarial training. We use extensive empirical studies to show that the proposed RUSH outperforms robust classifiers from adversarial training, by a significant margin on common benchmarks (CIFAR-10, CIFAR-100, and STL-10) under first-order attacks. In particular, under $\ell_{\infty}$-norm perturbations of size 8/255 PGD attack on CIFAR-10, our model using ResNet-18 as backbone reached 77.8% robust accuracy and 87.9% standard accuracy. Our work has an improvement of over 15% in robust accuracy and a slight improvement in standard accuracy, compared to the state-of-the-arts.  ( 2 min )
    Can Language Models perform Abductive Commonsense Reasoning?. (arXiv:2207.05155v1 [cs.AI])
    Abductive Reasoning is a task of inferring the most plausible hypothesis given a set of observations. In literature, the community has approached to solve this challenge by classifying/generating a likely hypothesis that does not contradict with a past observation and future observation. Some of the most well-known benchmarks that tackle this problem are aNLI and aNLG (pronounced as alpha-NLI and alpha-NLG). In this report, I review over some of the methodologies that were attempted to solve this challenge, re-implement the baseline models, and analyze some of the weaknesses that current approaches have. The code and the re-implemented results are available at this link.  ( 2 min )
    Online Continual Learning of End-to-End Speech Recognition Models. (arXiv:2207.05071v1 [cs.LG])
    Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available. While prior research on continual learning in automatic speech recognition has focused on the adaptation of models across multiple different speech recognition tasks, in this paper we propose an experimental setting for \textit{online continual learning} for automatic speech recognition of a single task. Specifically focusing on the case where additional training data for the same task becomes available incrementally over time, we demonstrate the effectiveness of performing incremental model updates to end-to-end speech recognition models with an online Gradient Episodic Memory (GEM) method. Moreover, we show that with online continual learning and a selective sampling strategy, we can maintain an accuracy that is similar to retraining a model from scratch while requiring significantly lower computation costs. We have also verified our method with self-supervised learning (SSL) features.  ( 2 min )
    Keep your Distance: Determining Sampling and Distance Thresholds in Machine Learning Monitoring. (arXiv:2207.05078v1 [cs.LG])
    Machine Learning~(ML) has provided promising results in recent years across different applications and domains. However, in many cases, qualities such as reliability or even safety need to be ensured. To this end, one important aspect is to determine whether or not ML components are deployed in situations that are appropriate for their application scope. For components whose environments are open and variable, for instance those found in autonomous vehicles, it is therefore important to monitor their operational situation to determine its distance from the ML components' trained scope. If that distance is deemed too great, the application may choose to consider the ML component outcome unreliable and switch to alternatives, e.g. using human operator input instead. SafeML is a model-agnostic approach for performing such monitoring, using distance measures based on statistical testing of the training and operational datasets. Limitations in setting SafeML up properly include the lack of a systematic approach for determining, for a given application, how many operational samples are needed to yield reliable distance information as well as to determine an appropriate distance threshold. In this work, we address these limitations by providing a practical approach and demonstrate its use in a well known traffic sign recognition problem, and on an example using the CARLA open-source automotive simulator.  ( 3 min )
  • Open

    Online Meta-Learning in Adversarial Multi-Armed Bandits. (arXiv:2205.15921v2 [cs.LG] UPDATED)
    We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.
    Scalable Bayesian Inference for Detection and Deblending in Astronomical Images. (arXiv:2207.05642v1 [astro-ph.IM])
    We present a new probabilistic method for detecting, deblending, and cataloging astronomical sources called the Bayesian Light Source Separator (BLISS). BLISS is based on deep generative models, which embed neural networks within a Bayesian model. For posterior inference, BLISS uses a new form of variational inference known as Forward Amortized Variational Inference. The BLISS inference routine is fast, requiring a single forward pass of the encoder networks on a GPU once the encoder networks are trained. BLISS can perform fully Bayesian inference on megapixel images in seconds, and produces highly accurate catalogs. BLISS is highly extensible, and has the potential to directly answer downstream scientific questions in addition to producing probabilistic catalogs.
    Improved Rates for Differentially Private Stochastic Convex Optimization with Heavy-Tailed Data. (arXiv:2106.01336v5 [cs.LG] UPDATED)
    We study stochastic convex optimization with heavy-tailed data under the constraint of differential privacy (DP). Most prior work on this problem is restricted to the case where the loss function is Lipschitz. Instead, as introduced by Wang, Xiao, Devadas, and Xu \cite{WangXDX20}, we study general convex loss functions with the assumption that the distribution of gradients has bounded $k$-th moments. We provide improved upper bounds on the excess population risk under concentrated DP for convex and strongly convex loss functions. Along the way, we derive new algorithms for private mean estimation of heavy-tailed distributions, under both pure and concentrated DP. Finally, we prove nearly-matching lower bounds for private stochastic convex optimization with strongly convex losses and mean estimation, showing new separations between pure and concentrated DP.
    On the Representation of Causal Background Knowledge and its Applications in Causal Inference. (arXiv:2207.05067v1 [cs.AI])
    Causal background knowledge about the existence or the absence of causal edges and paths is frequently encountered in observational studies. The shared directed edges and links of a subclass of Markov equivalent DAGs refined due to background knowledge can be represented by a causal maximally partially directed acyclic graph (MPDAG). In this paper, we first provide a sound and complete graphical characterization of causal MPDAGs and give a minimal representation of a causal MPDAG. Then, we introduce a novel representation called direct causal clause (DCC) to represent all types of causal background knowledge in a unified form. Using DCCs, we study the consistency and equivalency of causal background knowledge and show that any causal background knowledge set can be equivalently decomposed into a causal MPDAG plus a minimal residual set of DCCs. Polynomial-time algorithms are also provided for checking the consistency, equivalency, and finding the decomposed MPDAG and residual DCCs. Finally, with causal background knowledge, we prove a sufficient and necessary condition to identify causal effects and surprisingly find that the identifiability of causal effects only depends on the decomposed MPDAG. We also develop a local IDA-type algorithm to estimate the possible values of an unidentifiable effect. Simulations suggest that causal background knowledge can significantly improve the identifiability of causal effects.
    Shapley Computations Using Surrogate Model-Based Trees. (arXiv:2207.05214v1 [stat.ML])
    Shapley-related techniques have gained attention as both global and local interpretation tools because of their desirable properties. However, their computation using conditional expectations is computationally expensive. Approximation methods suggested in the literature have limitations. This paper proposes the use of a surrogate model-based tree to compute Shapley and SHAP values based on conditional expectation. Simulation studies show that the proposed algorithm provides improvements in accuracy, unifies global Shapley and SHAP interpretation, and the thresholding method provides a way to trade-off running time and accuracy.
    Uncertainty-Aware Learning Against Label Noise on Imbalanced Datasets. (arXiv:2207.05471v1 [stat.ML])
    Learning against label noise is a vital topic to guarantee a reliable performance for deep neural networks. Recent research usually refers to dynamic noise modeling with model output probabilities and loss values, and then separates clean and noisy samples. These methods have gained notable success. However, unlike cherry-picked data, existing approaches often cannot perform well when facing imbalanced datasets, a common scenario in the real world. We thoroughly investigate this phenomenon and point out two major issues that hinder the performance, i.e., \emph{inter-class loss distribution discrepancy} and \emph{misleading predictions due to uncertainty}. The first issue is that existing methods often perform class-agnostic noise modeling. However, loss distributions show a significant discrepancy among classes under class imbalance, and class-agnostic noise modeling can easily get confused with noisy samples and samples in minority classes. The second issue refers to that models may output misleading predictions due to epistemic uncertainty and aleatoric uncertainty, thus existing methods that rely solely on the output probabilities may fail to distinguish confident samples. Inspired by our observations, we propose an Uncertainty-aware Label Correction framework~(ULC) to handle label noise on imbalanced datasets. First, we perform epistemic uncertainty-aware class-specific noise modeling to identify trustworthy clean samples and refine/discard highly confident true/corrupted labels. Then, we introduce aleatoric uncertainty in the subsequent learning process to prevent noise accumulation in the label noise modeling process. We conduct experiments on several synthetic and real-world datasets. The results demonstrate the effectiveness of the proposed method, especially on imbalanced datasets.
    Collaborative Uncertainty Benefits Multi-Agent Multi-Modal Trajectory Forecasting. (arXiv:2207.05195v1 [cs.CV])
    In multi-modal multi-agent trajectory forecasting, two major challenges have not been fully tackled: 1) how to measure the uncertainty brought by the interaction module that causes correlations among the predicted trajectories of multiple agents; 2) how to rank the multiple predictions and select the optimal predicted trajectory. In order to handle these challenges, this work first proposes a novel concept, collaborative uncertainty (CU), which models the uncertainty resulting from interaction modules. Then we build a general CU-aware regression framework with an original permutation-equivariant uncertainty estimator to do both tasks of regression and uncertainty estimation. Further, we apply the proposed framework to current SOTA multi-agent multi-modal forecasting systems as a plugin module, which enables the SOTA systems to 1) estimate the uncertainty in the multi-agent multi-modal trajectory forecasting task; 2) rank the multiple predictions and select the optimal one based on the estimated uncertainty. We conduct extensive experiments on a synthetic dataset and two public large-scale multi-agent trajectory forecasting benchmarks. Experiments show that: 1) on the synthetic dataset, the CU-aware regression framework allows the model to appropriately approximate the ground-truth Laplace distribution; 2) on the multi-agent trajectory forecasting benchmarks, the CU-aware regression framework steadily helps SOTA systems improve their performances. Specially, the proposed framework helps VectorNet improve by 262 cm regarding the Final Displacement Error of the chosen optimal prediction on the nuScenes dataset; 3) for multi-agent multi-modal trajectory forecasting systems, prediction uncertainty is positively correlated with future stochasticity; and 4) the estimated CU values are highly related to the interactive information among agents.
    The d-separation criterion in Categorical Probability. (arXiv:2207.05740v1 [math.ST])
    The d-separation criterion detects the compatibility of a joint probability distribution with a directed acyclic graph through certain conditional independences. In this work, we study this problem in the context of categorical probability theory by introducing a categorical definition of causal models, a categorical notion of d-separation, and proving an abstract version of the d-separation criterion. This approach has two main benefits. First, categorical d-separation is a very intuitive criterion based on topological connectedness. Second, our results apply in measure-theoretic probability (with standard Borel spaces), and therefore provide a clean proof of the equivalence of local and global Markov properties with causal compatibility for continuous and mixed variables.
    Unsupervised learning of observation functions in state-space models by nonparametric moment methods. (arXiv:2207.05242v1 [stat.ML])
    We investigate the unsupervised learning of non-invertible observation functions in nonlinear state-space models. Assuming abundant data of the observation process along with the distribution of the state process, we introduce a nonparametric generalized moment method to estimate the observation function via constrained regression. The major challenge comes from the non-invertibility of the observation function and the lack of data pairs between the state and observation. We address the fundamental issue of identifiability from quadratic loss functionals and show that the function space of identifiability is the closure of a RKHS that is intrinsic to the state process. Numerical results show that the first two moments and temporal correlations, along with upper and lower bounds, can identify functions ranging from piecewise polynomials to smooth functions, leading to convergent estimators. The limitations of this method, such as non-identifiability due to symmetry and stationarity, are also discussed.
    Sliced-Wasserstein normalizing flows: beyond maximum likelihood training. (arXiv:2207.05468v1 [stat.ML])
    Despite their advantages, normalizing flows generally suffer from several shortcomings including their tendency to generate unrealistic data (e.g., images) and their failing to detect out-of-distribution data. One reason for these deficiencies lies in the training strategy which traditionally exploits a maximum likelihood principle only. This paper proposes a new training paradigm based on a hybrid objective function combining the maximum likelihood principle (MLE) and a sliced-Wasserstein distance. Results obtained on synthetic toy examples and real image data sets show better generative abilities in terms of both likelihood and visual aspects of the generated samples. Reciprocally, the proposed approach leads to a lower likelihood of out-of-distribution data, demonstrating a greater data fidelity of the resulting flows.
    Grounding Aleatoric Uncertainty in Unsupervised Environment Design. (arXiv:2207.05219v1 [cs.LG])
    Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.
    A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data. (arXiv:2201.12020v3 [stat.ML] UPDATED)
    This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A classical imputation method, the Expectation Maximization (EM) algorithm for Gaussian mixture models, has shown interesting properties when compared to other popular approaches such as those based on k-nearest neighbors or on multiple imputations by chained equations. However, Gaussian mixture models are known to be non-robust to heterogeneous data, which can lead to poor estimation performance when the data is contaminated by outliers or follows non-Gaussian distributions. To overcome this issue, a new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. This paper shows that this problem reduces to the estimation of a mixture of Angular Gaussian distributions under generic assumptions (i.e., each sample is drawn from a mixture of elliptical distributions, which is possibly different for one sample to another). In that case, the complete-data likelihood associated with mixtures of elliptical distributions is well adapted to the EM framework with missing data thanks to its conditional distribution, which is shown to be a multivariate $t$-distribution. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data. Furthermore, experiments conducted on real-world datasets show that this algorithm is very competitive when compared to other classical imputation methods.  ( 3 min )
    Capturing Evolution Genes for Time Series Data. (arXiv:1905.05004v2 [cs.LG] UPDATED)
    The modeling of time series is becoming increasingly critical in a wide variety of applications. Overall, data evolves by following different patterns, which are generally caused by different user behaviors. Given a time series, we define the evolution gene to capture the latent user behaviors and to describe how the behaviors lead to the generation of time series. In particular, we propose a uniform framework that recognizes different evolution genes of segments by learning a classifier, and adopt an adversarial generator to implement the evolution gene by estimating the segments' distribution. Experimental results based on a synthetic dataset and five real-world datasets show that our approach can not only achieve a good prediction results (e.g., averagely +10.56% in terms of F1), but is also able to provide explanations of the results.
    Size and depth of monotone neural networks: interpolation and approximation. (arXiv:2207.05275v1 [cs.LG])
    Monotone functions and data sets arise in a variety of applications. We study the interpolation problem for monotone data sets: The input is a monotone data set with $n$ points, and the goal is to find a size and depth efficient monotone neural network, with non negative parameters and threshold units, that interpolates the data set. We show that there are monotone data sets that cannot be interpolated by a monotone network of depth $2$. On the other hand, we prove that for every monotone data set with $n$ points in $\mathbb{R}^d$, there exists an interpolating monotone network of depth $4$ and size $O(nd)$. Our interpolation result implies that every monotone function over $[0,1]^d$ can be approximated arbitrarily well by a depth-4 monotone network, improving the previous best-known construction of depth $d+1$. Finally, building on results from Boolean circuit complexity, we show that the inductive bias of having positive parameters can lead to a super-polynomial blow-up in the number of neurons when approximating monotone functions.
    High-dimensional Inference for Dynamic Treatment Effects. (arXiv:2110.04924v3 [stat.ME] UPDATED)
    This paper proposes a confidence interval construction for heterogeneous treatment effects in the context of multi-stage experiments with $N$ samples and high-dimensional, $d$, confounders. Our focus is on the case of $d\gg N$, but the results obtained also apply to low-dimensional cases. We showcase that the bias of regularized estimation, unavoidable in high-dimensional covariate spaces, is mitigated with a simple double-robust score. In this way, no additional bias removal is necessary, and we obtain root-$N$ inference results while allowing multi-stage interdependency of the treatments and covariates. Memoryless property is also not assumed; treatment can possibly depend on all previous treatment assignments and all previous multi-stage confounders. Our results rely on certain sparsity assumptions of the underlying dependencies. We discover new product rate conditions necessary for robust inference with dynamic treatments.
    Edge Augmentation on Disconnected Graphs via Eigenvalue Elevation. (arXiv:2207.05301v1 [cs.SI])
    The graph-theoretical task of determining most likely inter-community edges based on disconnected subgraphs' intra-community connectivity is proposed. An algorithm is developed for this edge augmentation task, based on elevating the zero eigenvalues of graph's spectrum. Upper bounds for eigenvalue elevation amplitude and for the corresponding augmented edge density are derived and are authenticated with simulation on random graphs. The algorithm works consistently across synthetic and real networks, yielding desirable performance at connecting graph components. Edge augmentation reverse-engineers graph partition under different community detection methods (Girvan-Newman method, greedy modularity maximization, label propagation, Louvain method, and fluid community), in most cases producing inter-community edges at >50% frequency.  ( 2 min )
    Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders. (arXiv:2203.12742v2 [cs.LG] UPDATED)
    Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce new tasks optimizing \emph{in silico} and \emph{in vitro} properties of large-molecule fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.  ( 2 min )
    On robust risk-based active-learning algorithms for enhanced decision support. (arXiv:2201.02555v2 [cs.LG] UPDATED)
    Classification models are a fundamental component of physical-asset management technologies such as structural health monitoring (SHM) systems and digital twins. Previous work introduced risk-based active learning, an online approach for the development of statistical classifiers that takes into account the decision-support context in which they are applied. Decision-making is considered by preferentially querying data labels according to expected value of perfect information (EVPI). Although several benefits are gained by adopting a risk-based active learning approach, including improved decision-making performance, the algorithms suffer from issues relating to sampling bias as a result of the guided querying process. This sampling bias ultimately manifests as a decline in decision-making performance during the later stages of active learning, which in turn corresponds to lost resource/utility. The current paper proposes two novel approaches to counteract the effects of sampling bias: semi-supervised learning, and discriminative classification models. These approaches are first visualised using a synthetic dataset, then subsequently applied to an experimental case study, specifically, the Z24 Bridge dataset. The semi-supervised learning approach is shown to have variable performance; with robustness to sampling bias dependent on the suitability of the generative distributions selected for the model with respect to each dataset. In contrast, the discriminative classifiers are shown to have excellent robustness to the effects of sampling bias. Moreover, it was found that the number of inspections made during a monitoring campaign, and therefore resource expenditure, could be reduced with the careful selection of the statistical classifiers used within a decision-supporting monitoring system.
    Conservative SPDEs as fluctuating mean field limits of stochastic gradient descent. (arXiv:2207.05705v1 [math.PR])
    The convergence of stochastic interacting particle systems in the mean-field limit to solutions to conservative stochastic partial differential equations is shown, with optimal rate of convergence. As a second main result, a quantitative central limit theorem for such SPDEs is derived, again with optimal rate of convergence. The results apply in particular to the convergence in the mean-field scaling of stochastic gradient descent dynamics in overparametrized, shallow neural networks to solutions to SPDEs. It is shown that the inclusion of fluctuations in the limiting SPDE improves the rate of convergence, and retains information about the fluctuations of stochastic gradient descent in the continuum limit.  ( 2 min )
    AGBoost: Attention-based Modification of Gradient Boosting Machine. (arXiv:2207.05724v1 [cs.LG])
    A new attention-based model for the gradient boosting machine (GBM) called AGBoost (the attention-based gradient boosting) is proposed for solving regression problems. The main idea behind the proposed AGBoost model is to assign attention weights with trainable parameters to iterations of GBM under condition that decision trees are base learners in GBM. Attention weights are determined by applying properties of decision trees and by using the Huber's contamination model which provides an interesting linear dependence between trainable parameters of the attention and the attention weights. This peculiarity allows us to train the attention weights by solving the standard quadratic optimization problem with linear constraints. The attention weights also depend on the discount factor as a tuning parameter, which determines how much the impact of the weight is decreased with the number of iterations. Numerical experiments performed for two types of base learners, original decision trees and extremely randomized trees with various regression datasets illustrate the proposed model.
    Latent Variable Models for Bayesian Causal Discovery. (arXiv:2207.05723v1 [cs.LG])
    Learning predictors that do not rely on spurious correlations involves building causal representations. However, learning such a representation is very challenging. We, therefore, formulate the problem of learning a causal representation from high dimensional data and study causal recovery with synthetic data. This work introduces a latent variable decoder model, Decoder BCD, for Bayesian causal discovery and performs experiments in mildly supervised and unsupervised settings. We present a series of synthetic experiments to characterize important factors for causal discovery and show that using known intervention targets as labels helps in unsupervised Bayesian inference over structure and parameters of linear Gaussian additive noise latent structural causal models.
    Neural Posterior Estimation with Differentiable Simulators. (arXiv:2207.05636v1 [astro-ph.IM])
    Simulation-Based Inference (SBI) is a promising Bayesian inference framework that alleviates the need for analytic likelihoods to estimate posterior distributions. Recent advances using neural density estimators in SBI algorithms have demonstrated the ability to achieve high-fidelity posteriors, at the expense of a large number of simulations ; which makes their application potentially very time-consuming when using complex physical simulations. In this work we focus on boosting the sample-efficiency of posterior density estimation using the gradients of the simulator. We present a new method to perform Neural Posterior Estimation (NPE) with a differentiable simulator. We demonstrate how gradient information helps constrain the shape of the posterior and improves sample-efficiency.
    Efficient Real-world Testing of Causal Decision Making via Bayesian Experimental Design for Contextual Optimisation. (arXiv:2207.05250v1 [stat.ML])
    The real-world testing of decisions made using causal machine learning models is an essential prerequisite for their successful application. We focus on evaluating and improving contextual treatment assignment decisions: these are personalised treatments applied to e.g. customers, each with their own contextual information, with the aim of maximising a reward. In this paper we introduce a model-agnostic framework for gathering data to evaluate and improve contextual decision making through Bayesian Experimental Design. Specifically, our method is used for the data-efficient evaluation of the regret of past treatment assignments. Unlike approaches such as A/B testing, our method avoids assigning treatments that are known to be highly sub-optimal, whilst engaging in some exploration to gather pertinent information. We achieve this by introducing an information-based design objective, which we optimise end-to-end. Our method applies to discrete and continuous treatments. Comparing our information-theoretic approach to baselines in several simulation studies demonstrates the superior performance of our proposed approach.
    Robustness and Personalization in Federated Learning: A Unified Approach via Regularization. (arXiv:2009.06303v3 [cs.LG] UPDATED)
    We present a class of methods for robust, personalized federated learning, called Fed+, that unifies many federated learning algorithms. The principal advantage of this class of methods is to better accommodate the real-world characteristics found in federated training, such as the lack of IID data across parties, the need for robustness to outliers or stragglers, and the requirement to perform well on party-specific datasets. We achieve this through a problem formulation that allows the central server to employ robust ways of aggregating the local models while keeping the structure of local computation intact. Without making any statistical assumption on the degree of heterogeneity of local data across parties, we provide convergence guarantees for Fed+ for convex and non-convex loss functions under different (robust) aggregation methods. The Fed+ theory is also equipped to handle heterogeneous computing environments including stragglers without additional assumptions; specifically, the convergence results cover the general setting where the number of local update steps across parties can vary. We demonstrate the benefits of Fed+ through extensive experiments across standard benchmark datasets.
    Log-Euclidean Signatures for Intrinsic Distances Between Unaligned Datasets. (arXiv:2202.01671v2 [stat.ML] UPDATED)
    The need for efficiently comparing and representing datasets with unknown alignment spans various fields, from model analysis and comparison in machine learning to trend discovery in collections of medical datasets. We use manifold learning to compare the intrinsic geometric structures of different datasets by comparing their diffusion operators, symmetric positive-definite (SPD) matrices that relate to approximations of the continuous Laplace-Beltrami operator from discrete samples. Existing methods typically assume known data alignment and compare such operators in a pointwise manner. Instead, we exploit the Riemannian geometry of SPD matrices to compare these operators and define a new theoretically-motivated distance based on a lower bound of the log-Euclidean metric. Our framework facilitates comparison of data manifolds expressed in datasets with different sizes, numbers of features, and measurement modalities. Our log-Euclidean signature (LES) distance recovers meaningful structural differences, outperforming competing methods in various application domains.
    A Newton-CG based barrier method for finding a second-order stationary point of nonconvex conic optimization with complexity guarantees. (arXiv:2207.05697v1 [math.OC])
    In this paper we consider finding an approximate second-order stationary point (SOSP) of nonconvex conic optimization that minimizes a twice differentiable function over the intersection of an affine subspace and a convex cone. In particular, we propose a Newton-conjugate gradient (Newton-CG) based barrier method for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of this problem. Our method is not only implementable, but also achieves an iteration complexity of ${\cal O}(\epsilon^{-3/2})$, which matches the best known iteration complexity of second-order methods for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of unconstrained nonconvex optimization. The operation complexity of $\widetilde{\cal O}(\epsilon^{-3/2}\min\{n,\epsilon^{-1/4}\})$, measured by the amount of fundamental operations, is also established for our method.
    Parallel APSM for Fast and Adaptive Digital SIC in Full-Duplex Transceivers with Nonlinearity. (arXiv:2207.05461v1 [eess.SP])
    This paper presents a kernel-based adaptive filter that is applied for the digital domain self-interference cancellation (SIC) in a transceiver operating in full-duplex (FD) mode. In FD, the benefit of simultaneous transmission and receiving of signals comes at the price of strong self-interference (SI). In this work, we are primarily interested in suppressing the SI using an adaptive filter namely adaptive projected subgradient method (APSM) in a reproducing kernel Hilbert space (RKHS) of functions. Using the projection concept as a powerful tool, APSM is used to model and consequently remove the SI. A low-complexity and fast-tracking algorithm is provided taking advantage of parallel projections as well as the kernel trick in RKHS. The performance of the proposed method is evaluated on real measurement data. The method illustrates the good performance of the proposed adaptive filter, compared to the known popular benchmarks. They demonstrate that the kernel-based algorithm achieves a favorable level of digital SIC while enabling parallel computation-based implementation within a rich and nonlinear function space, thanks to the employed adaptive filtering method.
    The Cosmic Graph: Optimal Information Extraction from Large-Scale Structure using Catalogues. (arXiv:2207.05202v1 [astro-ph.CO])
    We present an implicit likelihood approach to quantifying cosmological information over discrete catalogue data, assembled as graphs. To do so, we explore cosmological inference using mock dark matter halo catalogues. We employ Information Maximising Neural Networks (IMNNs) to quantify Fisher information extraction as a function of graph representation. We a) demonstrate the high sensitivity of modular graph structure to the underlying cosmology in the noise-free limit, b) show that networks automatically combine mass and clustering information through comparisons to traditional statistics, c) demonstrate that graph neural networks can still extract information when catalogues are subject to noisy survey cuts, and d) illustrate how nonlinear IMNN summaries can be used as asymptotically optimal compressed statistics for Bayesian implicit likelihood inference. We reduce the area of joint $\Omega_m, \sigma_8$ parameter constraints with small ($\sim$100 object) halo catalogues by a factor of 42 over the two-point correlation function, and demonstrate that the networks automatically combine mass and clustering information. This work utilises a new IMNN implementation over graph data in Jax, which can take advantage of either numerical or auto-differentiability. We also show that graph IMNNs successfully compress simulations far from the fiducial model at which the network is fitted, indicating a promising alternative to $n$-point statistics in catalogue-based analyses.
    Wasserstein multivariate auto-regressive models for modeling distributional time series and its application in graph learning. (arXiv:2207.05442v1 [stat.ML])
    We propose a new auto-regressive model for the statistical analysis of multivariate distributional time series. The data of interest consist of a collection of multiple series of probability measures supported over a bounded interval of the real line, and that are indexed by distinct time instants. The probability measures are modelled as random objects in the Wasserstein space. We establish the auto-regressive model in the tangent space at the Lebesgue measure by first centering all the raw measures so that their Fr\'echet means turn to be the Lebesgue measure. Using the theory of iterated random function systems, results on the existence, uniqueness and stationarity of the solution of such a model are provided. We also propose a consistent estimator for the model coefficient. In addition to the analysis of simulated data, the proposed model is illustrated with two real data sets made of observations from age distribution in different countries and bike sharing network in Paris. Finally, due to the positive and boundedness constraints that we impose on the model coefficients, the proposed estimator that is learned under these constraints, naturally has a sparse structure. The sparsity allows furthermore the application of the proposed model in learning a graph of temporal dependency from the multivariate distributional time series.
    Markovian Gaussian Process Variational Autoencoders. (arXiv:2207.05543v1 [cs.LG])
    Deep generative models are widely used for modelling high-dimensional time series, such as video animations, audio and climate data. Sequential variational autoencoders have been successfully considered for many applications, with many variant models relying on discrete-time methods and recurrent neural networks (RNNs). On the other hand, continuous-time methods have recently gained attraction, especially in the context of irregularly-sampled time series, where they can better handle the data than discrete-time methods. One such class are Gaussian process variational autoencoders (GPVAEs), where the VAE prior is set as a Gaussian process (GPs), allowing inductive biases to be explicitly encoded via the kernel function and interpretability of the latent space. However, a major limitation of GPVAEs is that it inherits the same cubic computational cost as GPs. In this work, we leverage the equivalent discrete state space representation of Markovian GPs to enable a linear-time GP solver via Kalman filtering and smoothing. We show via corrupt and missing frames tasks that our method performs favourably, especially on the latter where it outperforms RNN-based models.
    Multi-Model Federated Learning with Provable Guarantees. (arXiv:2207.04330v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a variant of distributed learning where edge devices collaborate to learn a model without sharing their data with the central server or each other. We refer to the process of training multiple independent models simultaneously in a federated setting using a common pool of clients as multi-model FL. In this work, we propose two variants of the popular FedAvg algorithm for multi-model FL, with provable convergence guarantees. We further show that for the same amount of computation, multi-model FL can have better performance than training each model separately. We supplement our theoretical results with experiments in strongly convex, convex, and non-convex settings.

  • Open

    Is reinforcement learning the tool for this?
    Help with creating a first reinforcement learning AI I'm wondering if reinforcement learning is right for a game. In the game you need to pick which objects to move and move them an arbitrary distance to accomplish a desired configuration of objects and their connections. The point of the game is to move a minimal number of objects. I guess my question is can I use keras reinforcement learning to create an agent where its action is this: it picks an object, a direction and a distance it moves the object? Then it would make as much actions as it needs to solve the problem, and hopefully learn to solve it in less steps than previously until it reaches an optimal number of steps. And any feedback would be well and truly appreciated. Thanks in advance! submitted by /u/RollingLSlowly [link] [comments]  ( 85 min )
    CleanRL now has a DDPG + JAX implementation roughly 2.5-4x faster than DDPG + PyTorch
    submitted by /u/vwxyzjn [link] [comments]  ( 84 min )
    Oleh Rybkin, UPenn, on exploration and planning with world models
    Here is a podcast with Oleh Rybkin where we discuss agents that explore and plan (and do yoga), how to learn world models from video, what's missing from current RL research, and much more! submitted by /u/thejashGI [link] [comments]  ( 84 min )
    Is ML conferences challenge worth participating?
    Do industry and academia really value these challenges? Or, what is your thoughts about it? submitted by /u/Blasphemer666 [link] [comments]  ( 84 min )
    Help : have anyone coded Hexagonal Maze environment ?
    I am looking for maze like environment, each cell in a maze is hexagonal (6 sides), with few sides opened for passage and few sides act as a wall. submitted by /u/kachua26 [link] [comments]  ( 84 min )
    Best libraries to code gym envs simulation for GPU?
    I'm trying to test the speed between executing RL in CPU vs GPU for a simple workstation (user level high end PC). My nets are simple (3 layers of 256 units) and the environment I'm trying to test is a drone-like environment (similar to 3D robots without world interactions, only aerial movement physics). I've already executed only the training in GPU (specifically with ray/rllib), but due to small net and high compute sim, the speed is almost the same. I think due to latency sending back and ford the data. So now I want to execute all the train and simulation for the GPU. Up until now I've come to know Nvidia's Isaac Gym and Brax simulators, but both use libraries dedicated to using gpu (like Pytorch or Jax). Is there any other libraries? Which is easier to implement new custom gym envs? submitted by /u/NavirAur [link] [comments]  ( 85 min )
    Does entropy used in SAC and PPO different?
    Hi, I would like to know if implementation of entropy in SAC and PPO different? If yes, what is the difference? Thanks submitted by /u/4thfever [link] [comments]  ( 84 min )
  • Open

    [N] BigScience Releases their 176 Billion Parameter Open-access Multilingual Language Model
    BigScience recently released their new open-access (with weights) massive 176B language model that looks incredibly promising.The size is comparable to OpenAI's largest GPT-3 model. More info about the model can be found on BigScience's blog. You can play with the model interactively, for free(!) on Huggingface. submitted by /u/MonLiH [link] [comments]  ( 86 min )
    [R] Deep Hierarchical Planning from Pixels ( Director ) - Google 2022
    Paper: https://arxiv.org/pdf/2206.04114.pdf https://ai.googleblog.com/2022/07/deep-hierarchical-planning-from-pixels.html?m=1 Abstract: Intelligent agents need to select long sequences of actions to solve complex tasks. While humans easily break down tasks into subgoals and reach them through millions of muscle commands, current artificial intelligence is limited to tasks with horizons of a few hundred decisions, despite large compute budgets. Research on hierarchical reinforcement learning aims to overcome this limitation but has proven to be challenging, current methods rely on manually specified goal spaces or subtasks, and no general solution exists. We introduce Director, a practical method for learning hierarchical behaviors directly from pixels by planning inside the latent space of a learned world model. The high-level policy maximizes task and exploration rewards by selecting latent goals and the low-level policy learns to achieve the goals. Despite operating in latent space, the decisions are interpretable because the world model can decode goals into images for visualization. Director outperforms exploration methods on tasks with sparse rewards, including 3D maze traversal with a quadruped robot from an egocentric camera and proprioception, without access to the global position or top-down view that was used by prior work. Director also learns successful behaviors across a wide range of environments, including visual control, Atari games, and DMLab levels. https://preview.redd.it/lbvp6r7wl7b91.jpg?width=1034&format=pjpg&auto=webp&s=e9a28b2589eb41148de5b5bb6c4700354e795ae4 https://preview.redd.it/kikyu54xl7b91.jpg?width=1041&format=pjpg&auto=webp&s=b893e54790c420780c79819e689a9666ea95bf86 https://preview.redd.it/m5wc4tdxl7b91.jpg?width=1007&format=pjpg&auto=webp&s=17d7edf3cf7021ceabd3327d9408f1c3bd913c03 https://preview.redd.it/9cwsn9oxl7b91.jpg?width=1015&format=pjpg&auto=webp&s=c96348f290e9ff76c7003c51c97ac86705b77068 submitted by /u/Singularian2501 [link] [comments]  ( 86 min )
    [D] Does vector prediction merit using a multivariate output model?
    I am building a framework that predicts a displacement vector for a series of points on a map, using features from those points. There’s evidence of a relationship between the correlation coefficient of vector values (i.e. x and y-displacement) and some of the features. Would this merit using a multivariate output model (likely gradient boosting tree regression) or should I use two univariate output models? If not, what should I be looking into? submitted by /u/Boring-Violinist8291 [link] [comments]  ( 85 min )
    [P] Ensembling with multiple independent time-series
    I'm working on a project in which I have N independent time-series datasets, which can be thought of like prices for different currencies/crypto-coins etc. I've structured my dataset such that for each training batch, the first dimension is the index of the time-series. I have a prediction model based on a couple papers, which takes in a sliding window and outputs a prediction of the time series. Question: What is the best way to build an ensemble of this model, such that predictions for each time-series aren't affected by the others? When I say "aren't affected by other time series", i mean that the average of predictions of two different models trained on two different series might not be as accurate/precise as the predictions by themselves (without averaging)... Should I have N different models for each time series and just average the predictions? Should I have some K number of models with different loss functions and then average those? What would be a good strategy? submitted by /u/takeafuckinsipp [link] [comments]  ( 86 min )
    [R] DiBB: Distributing Black-Box Optimization
    Author here. Just presented this work at GECCO 2022. Quick summary: https://twitter.com/giuse_tweets/status/1546920346015637505 Paper: https://exascale.info/assets/pdf/cuccu2022gecco.pdf Code + tutorials: https://github.com/giuse/dibb Experiments (COCO/BBOB-LS): https://github.com/eXascaleInfolab/dibb_coco Recorded rehearsal of the talk: https://tinyurl.com/dibb-video AMA! submitted by /u/giuse_tweets [link] [comments]  ( 85 min )
    [D]Oleh Rybkin, UPenn, on exploration and planning with world models
    Here is a podcast with Oleh Rybkin where we discuss agents that explore and plan (and do yoga), how to learn world models from video, what's missing from current RL research, and much more! submitted by /u/thejashGI [link] [comments]  ( 85 min )
    [D] Does it make sense to generate text sequences with Transformer-based models and then have a classifier to choose between multiple options.
    Hello, I have a topic for discussion: Are you aware of systems which have a sequence-to-sequence architecture such as a Transformer, generating multiple outputs for a given task, and then another model - a MLP, another Transformer or something else which learns to pick the best option. Is it possible for this extra step to extract more knowledge from given data and increase the performance of the pipeline (even though at the cost of more computing power)? In what contexts does (not) that make sense? submitted by /u/IllustriousCicada603 [link] [comments]  ( 87 min )
    [P] Token-to-Token ViT Implementation in Flax
    ​ https://preview.redd.it/0mh5d00tx5b91.png?width=479&format=png&auto=webp&s=ac8c83e80d058d032e9083512da749216d9a2221 An open-source implementation of the Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet research paper in Google's JAX and Flax. "Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fail…  ( 87 min )
    [P] Run transformers model inference in C/C++ and Assembly with the Python C API
    ​ https://preview.redd.it/xjtcha3r35b91.png?width=1298&format=png&auto=webp&s=00873223c1ea0c6afcd5e22c7645521036b7e341 This post presents a way to run transformers models via the Python C API. The referenced notebook loads two txtai workflows, one that translates English to French and another that summarizes a webpage. After loading the models through C code, another example runs the workflows through assembly to show this works with any native code. Full code links: Notebook | GitHub submitted by /u/davidmezzetti [link] [comments]  ( 86 min )
    [P] DALL·E Mini & Mega demo and production API
    Hi all - we've just put out the community DALL·E models on Playgrounds.ai: Mega - https://playgrounds.ai/models/dalle-mega Mini - https://playgrounds.ai/models/dalle-mini You can use this models via API on PipelineCloud here: https://dashboard.pipeline.ai The per image cost for the models are approx: Mega - $0.0014 (~10s of compute for 4 images) Mini - $0.00062 (~10s of compute for 9 images) This is for people who want to use these models in their apps/products or just play around with the demos and have fun! https://preview.redd.it/zre4tf40a4b91.png?width=3114&format=png&auto=webp&s=68d8c10236cdd23c642e581d479d479b38fede84 submitted by /u/paulcjh [link] [comments]  ( 87 min )
    [R] On the Principles of Parsimony and Self-Consistency for the Emergence of Intelligence
    submitted by /u/hardmaru [link] [comments]  ( 85 min )
    [D] How to choose best model during training if validation loss fluctuates a lot?
    I am training a deep neural network, unfortunately, I have few samples for my validation set, so the relative loss fluctuate a lot. How can I choose the best model during training? Usually I choose the model which is associated with the lowest validation loss, but now there are random fluctuation that lower loss function. I think the fluctuations are due to the fact that I can't use the whole sample because I am using Colab free and i haven't enough RAM. I tried to modify the splig Train/Train/Vali increasing Vali size and the oscillations seems a bit lower, but i would like to mantain the ratio 60/20/20 for a better and more significative classification. submitted by /u/imunabletocode [link] [comments]  ( 88 min )
    [P] Helping data scientists access large ML datasets
    I spent so much time building data pipelines which feels like a huge constraint on my time and ability to focus on actual ML tasks. That's why I'm building subtask.net which collects and builds large, constantly updated, ML datasets from across the internet. The goal is to cut out the data collection part of any ML project and make more datasets available beyond the typical open-source datasets provided by the community. submitted by /u/subtask_net [link] [comments]  ( 86 min )
    [P] Building efficient ML applications with Taichi's automatic differentiation
    ​ https://i.redd.it/d66p6f6p23b91.gif Hey guys I am working on an open-source, parallel programming language, Taichi Lang, which I find efficient in differentiable physical simulation and can help speed up the convergence of ML processes. Above is a simple demo supported by Taichi's inbuilt autodiff (automatic differentiation) system. You can move the target as you wish, and the magic fountain always changes its trajectory accordingly to hit the target. So basically, Taichi Lang's Source Code Transformation system generates gradient kernels during compile time, and the lightweight tape in the Python scope records the launched Taichi kernels and replays the gradient kernels in reverse order during backpropagation. Model training is done within 10 optimization iterations. A step-by-step explanation: https://www.reddit.com/user/mingrui-zhang/comments/vx49mz/training_a_magic_fountain_using_taichis_autodiff/ Source code: https://github.com/taichi-dev/taichi/blob/master/python/taichi/examples/autodiff/diff_sph/diff_sph.py submitted by /u/mingrui-zhang [link] [comments]  ( 86 min )
    [D] Understanding how hardware plays a role in creating AI models
    I'm wondering if there's any sort of article/videos/reddit post focused on explaining everything to know about hardware and it's impact on AI (cores, tensors, cores, threading, etc.) I have a lot of background from the software side so code optimization isn't something that tI've thought too much about but I'm currently working on building my own PC so I do need this information (I'm not looking for a guide because that won't help me learn, but I want to learn all this stuff from the ground up). Any recommendations on where I can learn more about this? Thanks! submitted by /u/anacondavibes [link] [comments]  ( 86 min )
    [D] How do you verify the novelty of your research?
    While working on my own research and struggling to find related works it got me thinking. What process do you follow to discover preexisting research similar to your own? With the fast pace of research in the field, and so much overlapping terminology, do you use fancy tools or go beyond just typing queries into google scholar until you get relevant papers to your own? How do you find what you don't know to look for? submitted by /u/ajt9000 [link] [comments]  ( 95 min )
    [D] Efficiently choose good papers in top-tier conferences
    Hey, As a senior Phd student, I still feel a bit tired of looking for and reading through the massive newly accepted papers in top-tier conferences/journals like neurips/icml/iclr/jmlr/cvpr.... Any suggestions for efficiently selecting good papers ? submitted by /u/Ok-Wind-1215 [link] [comments]  ( 88 min )
    [R] Machine Learning Operations (MLOps): Overview, Definition, and Architecture
    Paper: https://arxiv.org/ftp/arxiv/papers/2205/2205.02302.pdf Abstract: The final goal of all industrial machine learning (ML) projects is to develop ML products and rapidly bring them into production. However, it is highly challenging to automate and operationalize ML products and thus many ML endeavors fail to deliver on their expectations. The paradigm of Machine Learning Operations (MLOps) addresses this issue. MLOps includes several aspects, such as best practices, sets of concepts, and development culture. However, MLOps is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we conduct mixed-method research, including a literature review, a tool review, and expert interviews. As a result of these investigations, we provide an aggregated overview of the necessary principles, components, and roles, as well as the associated architecture and workflows. Furthermore, we furnish a definition of MLOps and highlight open challenges in the field. Finally, this work provides guidance for ML researchers and practitioners who want to automate and operate their ML products with a designated set of technologies. https://preview.redd.it/km40o6fce1b91.jpg?width=785&format=pjpg&auto=webp&s=1e1079e839c8230f03df4bcd25b2cc3d58d42049 submitted by /u/Singularian2501 [link] [comments]  ( 86 min )
    "[Project]" Brainchop: In Browser 3D Segmentation. Now 50 and 104 Brain Segmentations. (Follow up).
    ​ https://reddit.com/link/vwxs2u/video/91mo2fnr81b91/player Live Demo: brainchop.org Brainchop is a client-side web-application for automatic segmentation of MRI volumes , we make implementation of brainchop freely available releasing its pure Javascript code as open-source. We appreciate your ideas/feedback /comments here or with the discussion board, and please star Brainchop if you like it to keep it going. submitted by /u/Character-Rip-5824 [link] [comments]  ( 85 min )
  • Open

    “Paranoid Android” created on Pixelz.ai by user - Prompt in comments 👇🏽
    submitted by /u/pixelz_ai [link] [comments]  ( 84 min )
    Amazon Rekognition takes over the internet
    submitted by /u/NarcoticSlug [link] [comments]  ( 84 min )
    Alien Architecture Generated By AI
    submitted by /u/Electronic-Dealer-71 [link] [comments]  ( 83 min )
    bonsai-bt: A Behavior Tree library in Rust for creating complex AI logic https://github.com/Sollimann/bonsai
    submitted by /u/Sollimann [link] [comments]  ( 84 min )
    The test the could change everything
    submitted by /u/kbf_ [link] [comments]  ( 84 min )
    Interview with AGI Journalist who covered DeepBlue/Kasparov & AlphaGo/Sedol in person. Interview on interesting insights - subscribe for similar AI content soon! :)
    submitted by /u/joemurray1994 [link] [comments]  ( 84 min )
    Oleh Rybkin, UPenn, on exploration and planning with world models
    Here is a podcast with Oleh Rybkin where we discuss agents that explore and plan (and do yoga), how to learn world models from video, what's missing from current RL research, and much more! submitted by /u/thejashGI [link] [comments]  ( 84 min )
    BigScience AI Researchers Open-Source ‘BLOOM’: An Autoregressive Multilingual Large Language Model Larger Than GPT-3 and OPT-175B
    BigScience Project introduces BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), the first multilingual Large Language Model (LLM) trained in complete transparency by the largest group of AI academics. Unlike the traditional secrecy of industrial AI research laboratories, the project demonstrates the possibility of training promising AI models published by the larger research community responsibly and openly. ✅ Transformers-based LLM ✅ 176B parameters (larger than GPT-3 and OPT-175B) ✅ Trained on 1.6TB text data, the equivalent of 320 times the complete works of Shakespeare Continue reading | Download submitted by /u/ai-lover [link] [comments]  ( 84 min )
    i experimented a bit with ai, that's what i get 😈
    submitted by /u/nalr00n [link] [comments]  ( 84 min )
    Top 10 AI Jobs and The Best Places to Find Them
    ​ This infographic shows the top job roles requiring AI and ML skills as well as the most attractive cities for AI jobs and the best companies in the field to work for. submitted by /u/Emily-joe [link] [comments]  ( 84 min )
    Psalms 34 completely illustrated with MidjourneyAI art - none of these images were post edited in any way, more details about creation in the description of the video
    submitted by /u/Racer_x32 [link] [comments]  ( 86 min )
    Sclera, Iris and Pupil Detector
    submitted by /u/Gloomy_Recognition_4 [link] [comments]  ( 86 min )
    73% of people mistook AI-generated images for human-made artwork
    submitted by /u/KazRainer [link] [comments]  ( 84 min )
    Hard rules in a GAN Neural Network.
    I have a script that can accept/reject outputs of the generator based on a set of rules, and I want to integrate it into the GAN, however, I'm not sure how to do so without breaking the math of the backpropagation and other stuff. What is the correct approach to this problem? submitted by /u/iLoveNintend0 [link] [comments]  ( 84 min )
    Heuristics and Algorithms in AI
    So this might be a more theoretical question and im not sure its related to this sub but i'll shoot my shot anyway: BFS,DFS,ITERATIVE DEEPENING and UNIFORM COST SEARCH are all algorithms that find a path in our domain of states that make NO USE of heuristics, They are what we call "blind search", Uniform cost search makes use of the weight between each nodes and the other three just blindly go through the nodes as if each edge has a weight of 1. GREEDY BEST FIRST SEARCH and A* are both algorithms that make use of heuristics which is basically a function that should give an estimation of a node n of a cost to the target node. I keep getting confused about each of them so would like to know if what i wrote above is correct. Thank you for your time. EDIT: haven't talked about completeness and optimal heuristics because i think i got those down just fine. submitted by /u/Alternative_Shoe2623 [link] [comments]  ( 84 min )
    I programmed Minecraft to control real LEDs when I look at the corresponding color in Minecraft (using computer vision as a real time data collection system)
    submitted by /u/MrDemonFrog [link] [comments]  ( 84 min )
    Wondrous Fairy Escapade | Cinematic 4K 24 FPS (FILM)
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 84 min )
  • Open

    Grand Entrance: Human Horizons Unveils Smart GT Built on NVIDIA DRIVE Orin
    Touring vehicles just became a little more grand. Electric vehicle maker Human Horizons provided a detailed glimpse earlier this month of its latest production model, the GT HiPhi Z. The intelligent EV is poised to redefine the grand tourer category with innovative, software-defined capabilities that bring luxurious cruising to the next level. The vehicle’s marquee Read article > The post Grand Entrance: Human Horizons Unveils Smart GT Built on NVIDIA DRIVE Orin appeared first on NVIDIA Blog.  ( 5 min )
    Merge Ahead: Researcher Takes Software Bridge to Quantum Computing
    Kristel Michielsen was into quantum computing before quantum computing was cool. The computational physicist simulated quantum computers as part of her Ph.D. work in the Netherlands in the early 1990s. Today, she manages one of Europe’s largest facilities for quantum computing, the Jülich Unified Infrastructure for Quantum Computing (JUNIQ) . Her mission is to help Read article > The post Merge Ahead: Researcher Takes Software Bridge to Quantum Computing appeared first on NVIDIA Blog.  ( 6 min )
    Sequences That Stun: Visual Effects Artist Surfaced Studio Arrives ‘In the NVIDIA Studio’
    Visual effects savant Surfaced Studio steps In the NVIDIA Studio this week to share his clever film sequences, Fluid Simulation and Destruction, as well as his creative workflows. These sequences feature quirky visual effects that Surfaced Studio is renowned for demonstrating on his YouTube channel. The post Sequences That Stun: Visual Effects Artist Surfaced Studio Arrives ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Artificial intelligence model finds potential drug molecules a thousand times faster
    A geometric deep-learning model is faster and more accurate than state-of-the-art computational models, reducing the chances and costs of drug trial failures.  ( 6 min )
  • Open

    Revisiting Mask Transformer from a Clustering Perspective
    Posted by Qihang Yu, Student Researcher, and Liang-Chieh Chen, Research Scientist, Google Research Panoptic segmentation is a computer vision problem that serves as a core task for many real-world applications. Due to its complexity, previous work often divides panoptic segmentation into semantic segmentation (assigning semantic labels, such as “person” and “sky”, to every pixel in an image) and instance segmentation (identifying and segmenting only countable objects, such as “pedestrians” and “cars”, in an image), and further divides it into several sub-tasks. Each sub-task is processed individually, and extra modules are applied to merge the results from each sub-task stage. This process is not only complex, but it also introduces many hand-designed priors when processing sub-tasks and …  ( 23 min )
  • Open

    Real-Time Apps: Why Node.js is the Ideal Choice
    In a world where technology is evolving at a tremendous pace, it comes as no surprise that there’s an increase in demand for apps that interact with users in real time. And, it is no secret that the development of real-time apps is an extremely popular concept in the global market, thanks to rapid digitalization… Read More »Real-Time Apps: Why Node.js is the Ideal Choice The post Real-Time Apps: Why Node.js is the Ideal Choice appeared first on Data Science Central.  ( 18 min )
    Web Analytics Dashboards Carry a World of Data for Various Purposes
    Web analytics tools offer vital insights into your website’s visitors’ behavior by tracking their real-time activities on the platform from behind. These tools study almost everything – the number of daily and regular visitors, sessions and duration, conversions, and beyond. You can access a comprehensive report covering every aspect and personalize it to focus on… Read More »Web Analytics Dashboards Carry a World of Data for Various Purposes The post Web Analytics Dashboards Carry a World of Data for Various Purposes appeared first on Data Science Central.  ( 18 min )
    Top Picks for Blockchain Certifications
    Blockchain Certifications and cryptocurrency have become popular among many new internet businesses. The security and transparency this technology offers are some of the reasons why cryptocurrency has gained popularity over the past years. Blockchain technology has remained to be the backbone of cryptocurrency. It is particularly helpful in maintaining data related to public transactions. The best… Read More »Top Picks for Blockchain Certifications The post Top Picks for Blockchain Certifications appeared first on Data Science Central.  ( 19 min )
  • Open

    Understanding the Design of a Convolutional Neural Network
    Convolutional neural networks have been found successful in computer vision applications. Various network architectures are proposed and they are neither magical nor hard to understand. In this tutorial, we will make sense of the operation of convolutional layers and their role in a larger convolutional neural network. After finishing this tutorial, you will learn: How […] The post Understanding the Design of a Convolutional Neural Network appeared first on Machine Learning Mastery.  ( 14 min )
  • Open

    How to make awesome datasets fast with Scrapy in Python
    Scrapy is highly customizable and developer friendly crawling framework in Python. It can help you build in few line wonderful crawler to…  ( 11 min )
  • Open

    Conway’s factoring trick
    The numbers 152 through 156 have a lot of small prime factors. I’ll be more explicit about that shortly, but take my word for it for now. John Conway [1] took this simple observation and turned it into a technique for mentally factoring integers. Conway’s factoring method To look for factors of a number n, […] Conway’s factoring trick first appeared on John D. Cook.  ( 7 min )
  • Open

    Sentiment Analysis of Stocktwits Messages using LSTM in PyTorch
    submitted by /u/Vasilkosturski [link] [comments]  ( 84 min )

  • Open

    [D] How to work with audio data?
    I have to work on a ML model which listens to sounds and classifies them as rat squeaks or not for my college project. Although I have already created a model using MFCC to convert the audios into float arrays (which are called feature vectors I think however I'm not 100% sure what they are) I later changed the sampling frequency everytime I took a different audio as input (in order to create the same number of array of arrays as output of the MFCC, i noticed changing the sampling rate changed the number of arrays outputted i think the correct term for it is hop_length) i couldn't use librosa as I couldn't install llvmlite after spending like half a day on it. Then I took each and every float in the arrays (61 arrays formed for each sound each containing 13 integers) and used it as a feature and ran RFC. (had 793 different columns at the end) My dataset is also just 159 sounds, most of which come from a machine squeaking sounds dataset which my teammate manually labelled those which sounded like rat squeaks as yes and rest as no. Then like 15 actual rat sounds mixed in (for which I had to change the hop_length, again idek what it actually is but I had to get the array lengths same. I looked up a lot on the internet but didn't seem to find any rat sound dataset nor anyone who could explain MFCCs properly) Needless to say, my ML model is quite inaccurate. Anyway, I think there has to be a better method than this in order to deal with audio data classify it. Can anyone who has experience with this, help me out? Thanks. submitted by /u/Spinner4177 [link] [comments]  ( 87 min )
    [P] Paper Implementation - Extracting Training Data from Large Language Models
    A re-implementation of the famous 2020 paper - "Extracting Training Data from Large Language Models" by Nicholas Carlini, Florian Tramer et al. Code - https://github.com/shreyansh26/Extracting-Training-Data-from-Large-Langauge-Models The official implementation is great and I definitely learned a few things from it. In the re-implementation, I have also included the temperature-decay sampling and sliding-window-based minimum perplexity metric which was not present in the official implementation. I checked the extracted Samples (refer to the Github repo) and they surely contained some memorized information. submitted by /u/shreyansh26 [link] [comments]  ( 85 min )
    [P] ScalableViT Implementation in Flax
    An open-source implementation of the ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer research paper in Google's JAX and Flax. "The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrix while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, the Scalable Vision Transformer (ScalableViT) achieves state-of-the-art performance in general-purpose vision tasks. For example, ScalableViT-S outperforms Twins-SVT-S by 1.4% and Swin-T by 1.8% on ImageNet-1K classification." - Rui Yang, Hailong Ma, Jie Wu, Yansong Tang, Xuefeng Xiao, Min Zheng, Xiu Li Github repository for the Flax / JAX model: https://github.com/conceptofmind/Scalable-ViT-flax ScalableViT Research Paper: https://arxiv.org/abs/2203.10790 In collaboration with Lucid: https://github.com/lucidrains submitted by /u/EnricoShippole [link] [comments]  ( 86 min )
    [D] Instance segmentation using transformers
    Hi folks! I am looking for beginner-friendly and easy to implement papers on instance segmentation using transformers. Any help will be appreciated!! submitted by /u/cheemsdoge69 [link] [comments]  ( 85 min )
    [D] Speech Enhancement SOTA
    Audio denoising (removing background noises from audio), often referred as Speech Enhancement, has been a midly popular research field up to 2020. This was due to COVID and the need to filter unwanted noises from calls. However, I'm not sure where we're at today: Music Source Separation is improved by Tiktok and Deezer's researches Meta's denoiser looks like the most standard, production-ready model, and it implements a 2020 paper I'd like to search for more alternatives, but I struggle to find some: Googling "Denoising" will lead to images noise removal Paper with Code's "Speech denoising" and "Audio denoising" categories are pretty empty. The "Speech Enhancement" category seems to be the real deal, but the top models don't have any pretrained version available. Is there a model that outperform Meta's denoiser, while remaining open-source with an available pretrained model? submitted by /u/chaude_patate [link] [comments]  ( 86 min )
    [R] DA-Faster RCNN
    Hello, I have reimplemented DA-Faster RCNN using Detectron2 one of the most important architecture for domain adaptation for object detection. This implementations is easy to use and can be used also with google colab :) here there is the link: https://github.com/GiovanniPasq/DA-Faster-RCNN submitted by /u/CapitalShake3085 [link] [comments]  ( 85 min )
    [D] What is your go-to algorithm for Multiple Object Tracking with possible long time occlusions?
    Im interested in tracking cars and people with ability to solve occlusion of objects that might not be moving. Things I've tried are decent but not amazing(Deepsort, ByteTrack). There is a few recent studies about using transformers for tracking, but those things are heavy and not really production material, having deformable convolutions in them(hard or not possible convert to torchscript and tensorrt) and all. What's your go-to algorithm for this kind of problem? submitted by /u/InfiniteLife2 [link] [comments]  ( 87 min )
    [D] Modeling Adjacency Matrix
    Lets assume, I have some directed adjacency matrix A at time t and another adjacency matrix B at time t+1. I want to learn a mapping from A to B through some model f (suppose f is a neural network). Now, how should I create this model ? Should I use just Dense layers or GNNs or something? submitted by /u/Labib666Camp [link] [comments]  ( 86 min )
    [P] Semi-supervised learning for tabular data: VIME
    A lot of recent DL models for tabular data have used some sort of pre-training to increase the robustness and performance metrics on smaller/noisy datasets. That's why I've decided to write a deep-dive blog into a VIME paper which was one of the first to suggest pre-training tasks specific for tabular data. It comes with an accompanying repo that contains all the code and notebooks. From some personal testing that I've done, pre-training is the most valuable does improve the performance when we're dealing with very few labels (1-5% of the dataset). Of course, the best solution is to always get more labels lol, but when it's not possible, pre-training schemes like VIME can give you a small boost in performance. Give it a read and let me know what you think! I'll keep covering some interesting deep tabular architectures, so maybe also let me know which one would you want me to cover next! submitted by /u/blessedorcursed [link] [comments]  ( 86 min )
    [R] Closed-Form Diffeomorphic Transformations for Time Series Alignment
    Paper: https://arxiv.org/pdf/2206.08107.pdf Code: https://github.com/imartinezl/difw Abstract: Time series alignment methods call for highly expressive, differentiable and invertible warping functions which preserve temporal topology, i.e diffeomorphisms. Diffeomorphic warping functions can be generated from the integration of velocity fields governed by an ordinary differential equation (ODE). Gradient-based optimization frameworks containing diffeomorphic transformations require to calculate derivatives to the differential equation's solution with respect to the model parameters, i.e. sensitivity analysis. Unfortunately, deep learning frameworks typically lack automatic-differentiation-compatible sensitivity analysis methods; and implicit functions, such as the solution of ODE, require particular care. Current solutions appeal to adjoint sensitivity methods, ad-hoc numerical solvers or ResNet's Eulerian discretization. In this work, we present a closed-form expression for the ODE solution and its gradient under continuous piecewise-affine (CPA) velocity functions. We present a highly optimized implementation of the results on CPU and GPU. Furthermore, we conduct extensive experiments on several datasets to validate the generalization ability of our model to unseen data for time-series joint alignment. Results show significant improvements both in terms of efficiency and accuracy. https://reddit.com/link/vwf9wo/video/vvjnwp2y0xa91/player ​ submitted by /u/inigomlap [link] [comments]  ( 87 min )
    [R] An awesome collection of Federated learning & Blockchain research papers in the Healthcare domain
    An awesome collection of Federated learning & Blockchain research papers in the Healthcare domain. Federated learning, a mechanism of training a shared global model with a central server while keeping all the sensitive data in local institutions where the data belong, provides great promise to connect the fragmented healthcare data sources with privacy preservation. This repo contains a curated list of Federated Learning papers/resources and recent advancements in Healthcare. ​ As of now ~330 papers Pr's welcome https://github.com/monk1337/Aweome-Heathcare-Federated-Learning submitted by /u/aadityaura [link] [comments]  ( 85 min )
    [P]I used Note System on MNIST,traning speed was increased by more than two times!You can view this project on my github.
    submitted by /u/7NoteDancing [link] [comments]  ( 85 min )
    [D] Next big thing in the field
    Do you guys have any forecasts of next big model/algorithm/concept in DL? We had CNNs disrupting the field in ~2015, then GANs became a big deal, RL grown quite a lot, Transformers trended recently, now Diffusion models are moving probabilistic ML forward (sorry if I missed something). What other not fully investigated or underestimated concepts with high potential are there? submitted by /u/AdelSexy [link] [comments]  ( 93 min )
    [D] Why are Corgi dogs so popular in machine learning (especially in the image generation community)?
    For example, here's part of OpenAI's GLIDE paper: https://preview.redd.it/b6vkxyb3xua91.png?width=1225&format=png&auto=webp&s=15d56f256e323bb54d22eb9fdc0538644060c4a7 submitted by /u/Azuresonance [link] [comments]  ( 90 min )
  • Open

    I made cursed cartoon characters using Dream by Wombo
    submitted by /u/GetFlappy [link] [comments]  ( 84 min )
    Endless fun with HP creations 🧙🏼Voldemort Sketch on Pixelz.ai
    submitted by /u/mdfnb [link] [comments]  ( 84 min )
    The old and the new
    submitted by /u/deephugs [link] [comments]  ( 83 min )
    Is the brain an AI made of other AIs?
    If the brain can be broken down to multiple single-function specializing parts, what's stopping engineers to design AI for each of those parts and have all of them feed the resulting data into one overarching AI that, in turn, eats up those data and outputs...magic? Just a thought. I'm bored and I have 0 competence in AI, just a curious layman. Hope you may indulge my ignorance. Cheers! submitted by /u/TWHreddit [link] [comments]  ( 85 min )
    Weekly China AI News: Meet World's 1st Redstonic Neural Network in Minecraft; Shenzhen Holds Self-Driving Car Drivers Responsible for Crashes; AI Brings Back Decades-Old Concert
    submitted by /u/trcytony [link] [comments]  ( 84 min )
    Paper Implementation - Extracting Training Data from Large Language Models
    A re-implementation of the famous 2020 paper - "Extracting Training Data from Large Language Models" by Nicholas Carlini, Florian Tramer et al. Code - https://github.com/shreyansh26/Extracting-Training-Data-from-Large-Langauge-Models The official implementation is great and I definitely learned a few things from it. In the re-implementation, I have also included the temperature-decay sampling and sliding-window-based minimum perplexity metric which was not present in the official implementation. I checked the extracted Samples (refer to the Github repo) and they surely contained some memorized information. submitted by /u/shreyansh26 [link] [comments]  ( 84 min )
    Anyone Doing Andrew NG's Machine Learning Specialization?
    submitted by /u/biggbrother23 [link] [comments]  ( 83 min )
    New Open-Source Large Language Model 'Bloom' Does 40+ Languages And Has 176 Billion Parameters
    submitted by /u/getrich_or_diemining [link] [comments]  ( 84 min )
    Large language models might reason—if you know how to speak to them
    submitted by /u/bendee983 [link] [comments]  ( 84 min )
    i have used the amazing innovation of frame interpolation to make 60fps memes
    submitted by /u/oliviagolds [link] [comments]  ( 85 min )
    Its time for ai generator corporated!!!
    submitted by /u/GroundbreakingLaw878 [link] [comments]  ( 83 min )
    Ray Kurzweil Wants to Upload Your Brain to the Cloud
    submitted by /u/jormungandrsjig [link] [comments]  ( 85 min )
    Have you ever used an AI-powered photo editor? Could someone give me some advice on using it?
    submitted by /u/Lower_Peanut_9665 [link] [comments]  ( 84 min )
    is there an ai that can make a analog horror by text?
    Im looking for scares a bit so if theres any link it to me :D submitted by /u/GroundbreakingLaw878 [link] [comments]  ( 85 min )
    is there an ai that can make a analog horror by text?
    submitted by /u/GroundbreakingLaw878 [link] [comments]  ( 84 min )
    my generated art in nightcafe
    submitted by /u/GroundbreakingLaw878 [link] [comments]  ( 84 min )
  • Open

    AI on the Sky: Stunning New Images From the James Webb Space Telescope To Be Analyzed by, Train, AI
    The unveiling by U.S. President Joe Biden Monday of the first full-color image from the James Webb Space Telescope is already astounding — and delighting — humans around the globe. “We can see possibilities nobody has ever seen before, we can go places nobody has ever gone before,” Biden said during a White House press Read article > The post AI on the Sky: Stunning New Images From the James Webb Space Telescope To Be Analyzed by, Train, AI appeared first on NVIDIA Blog.  ( 5 min )
    Windfall: Omniverse Accelerates Turning Wind Power Into Clean Hydrogen Fuel
    Engineers are using the NVIDIA Omniverse 3D simulation platform as part of a proof of concept that promises to become a model for putting green energy to work around the world. Dubbed Gigastack, the pilot project — led by a consortium that includes Phillips 66 and Denmark-based renewable energy company Ørsted — will create low-emission Read article > The post Windfall: Omniverse Accelerates Turning Wind Power Into Clean Hydrogen Fuel appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    "CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships", Roelofs et al 2022 {Waymo}
    submitted by /u/gwern [link] [comments]  ( 84 min )
    "Director: Deep Hierarchical Planning from Pixels", Hafner et al 2022 {G} (hierarchical RL over world models)
    submitted by /u/gwern [link] [comments]  ( 84 min )
    "Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning", Fu et al 2022 (effectiveness of policy gradient MARL)
    submitted by /u/gwern [link] [comments]  ( 84 min )
    Visual (pixel based) RL, CNNs & Autoencoders
    There's been a lot of hype around visual RL (using pixel input for the agent's network) ever since Deepmind's DQN back in 2015. However, to the best of my knowledge, it seems like there hasn't been a lot of published work since then that uses images as observations. Therefore I have a few questions /discussion points for the community: ​ Have there been many/any notable image-based RL agents since DQN? If so, could you point me towards some? Are CNNs a good way to approach this type of RL? Could the CNN be trained independently of the agent, so that once the CNN is trained and can extract features and provide them as input to the agent, we can focus on training the agent and tuning its specific hyperparameters? How would one train a CNN independently from the agent? What would the CNN be trying to do? This leads me to think that autoencoders may be a good solution, since one can train them to reconstruct the original image and then use the trained encoder to build a latent space/compact feature representation of the original image during training as the input to the agent. Is this a good/bad idea? Has work been done in this area, if so could you point me towards it? This may seem like a lot but hopefully the evolution of my thoughts makes sense and therefore can start a discussion here :) Looking forward to hearing back from the community! submitted by /u/leozinho2r [link] [comments]  ( 86 min )
    PrefixRL: Optimization Of Parallel Prefix Circuits Using DRL {NVIDIA}
    submitted by /u/yazriel0 [link] [comments]  ( 84 min )
  • Open

    LiDAR 3D point cloud labeling with Velodyne LiDAR sensor in Amazon SageMaker Ground Truth
    LiDAR is a key enabling technology in growing autonomous markets, such as robotics, industrial, infrastructure, and automotive. LiDAR delivers precise 3D data about its environment in real time to provide “vision” for autonomous solutions. For autonomous vehicles (AVs), nearly every carmaker uses LiDAR to augment camera and radar systems for a comprehensive perception stack capable […]  ( 13 min )
  • Open

    A Gentle Introduction to tensorflow.data API
    When we build and train a Keras deep learning model, the training data can be provided in several different ways. Presenting the data as a NumPy array or a TensorFlow tensor is a common one. Making a Python generator function and let the training loop to read data from it is another way. Yet another […] The post A Gentle Introduction to tensorflow.data API appeared first on Machine Learning Mastery.  ( 17 min )
  • Open

    Why We Need to Move From Data-First to a Knowledge-First World
    We live in a data-rich world. Very data rich. Indeed, it’s estimated that roughly 2.5 quintillion bytes of data are created every day. Perhaps because of its ubiquity, there are those who believe the sheer volume of available data means we have all we need to easily and accurately answer any question without delay. If… Read More »Why We Need to Move From Data-First to a Knowledge-First World The post Why We Need to Move From Data-First to a Knowledge-First World appeared first on Data Science Central.  ( 19 min )
    Critical Role of Analytic Profiles in Developing Data Products
    The tech industry is abuzz with hyped up pontifications and bold predictions of the business-changing potential of Data Products.  I could not be happier as it’s a topic I have explored in several blogs (see the end of this blog for a list of my blogs on Data Products…yea, I know, get a life). A… Read More »Critical Role of Analytic Profiles in Developing Data Products The post Critical Role of Analytic Profiles in Developing Data Products appeared first on Data Science Central.  ( 20 min )
    Metaverse use cases – Which industries could the metaverse impact?
    According to the McKinsey Report called Value Creation in the Metaverse: $120b+ in investment has flowed into the metaverse so far in 2022 79% of consumers active on the metaverse have made a purchase >15% of corporate revenue is expected to come from the metaverse in the next 5 years according to 25% of senior… Read More »Metaverse use cases – Which industries could the metaverse impact? The post Metaverse use cases – Which industries could the metaverse impact? appeared first on Data Science Central.  ( 19 min )
    Features of IIoT (Industrial Internet of Things) Seamless Connectivity and Data Acquisition
    Executing Industrial Internet of Things (IIoT) solutions is vital as the most competitive global manufacturing companies are becoming digital enterprises. Industrial Internet of Things (IIoT) solutions and platforms are leading the reshaping and transformation of landscapes. A pre-built Industrial Internet of Things (IIoT) solution offers the benefit of a ready-made “IoT development kit” with the… Read More »Features of IIoT (Industrial Internet of Things) Seamless Connectivity and Data Acquisition The post Features of IIoT (Industrial Internet of Things) Seamless Connectivity and Data Acquisition appeared first on Data Science Central.  ( 18 min )
    Navigating the Costs of Cloud Networks
    Cloud networks have grown from what was seen as a passing trend by some experts, into full-fledged solutions that power some of the most important parts of various industries at this point. Large companies like Google and Microsoft have been investing steadily in the growth of their own solutions. At the same time, more tightly… Read More »Navigating the Costs of Cloud Networks The post Navigating the Costs of Cloud Networks appeared first on Data Science Central.  ( 21 min )
    Data Observability Vs Data Quality: What makes them different?
    Defining Data Observabilityand Data Quality As companies gather seemingly endless data streams from an increasing number of sources, they start to amass an ecosystem of data storage, would-be end-users, and pipelines. With each additional layer of complexity, opportunities for data downtime, moments when data is partial, erroneous, missing, or otherwise inaccurate, multiply. As a result,… Read More »Data Observability Vs Data Quality: What makes them different? The post Data Observability Vs Data Quality: What makes them different? appeared first on Data Science Central.  ( 19 min )
    Talent Management and Technology-A Perfect Blend
    Most small to medium businesses this year exhibit a desire to expand their scope of operations and increase their employee count over the course of the next year, a sign that the job market is on an upwards rise. That being said, the best and brightest are likely to be picked relatively quickly so competition… Read More »Talent Management and Technology-A Perfect Blend The post Talent Management and Technology-A Perfect Blend appeared first on Data Science Central.  ( 18 min )
    The Two Types of Agility You Need
    If your business is going to survive, you must be able to read and react to changes in your markets and continuously improve your competitive position.  It’s more important now than it’s ever been. SWOT is a model often employed to characterize a company’s competitive position in terms of Strengths, Weaknesses, Opportunities, and Threats.  If… Read More »The Two Types of Agility You Need The post The Two Types of Agility You Need appeared first on Data Science Central.  ( 21 min )
    Webinar Series -The rise of the Modern DataStack and the Modern Data Quality Platform
    On Wednesday, July 13th at 11 am EST, please join DQLabs for an exclusive virtual event“Defining Data Relevance: The rise of the Modern Data Stack and the Modern Data Quality Platform”. The data producers, consumers, and leaders deserve an ecosystem that delivers the data that is relevant to them – one size fits all approaches… Read More »Webinar Series -The rise of the Modern DataStack and the Modern Data Quality Platform The post Webinar Series -The rise of the Modern DataStack and the Modern Data Quality Platform appeared first on Data Science Central.  ( 17 min )
  • Open

    Your RPA Implementation Must be at Risk! [Here Are 7 Reasons Why]
    IT leaders are running into several RPA failures. Here, we have covered the top 7 reasons why RPA implementations fail and how you can…  ( 12 min )
    Which tool is the best to make a complete dataset?
    Crawling a website is as today an essential skill for anyone working in or with the digital industry. Firstly, I will start by clarifying…  ( 10 min )
    Google Imagen: text-to-image AI
    Google Imagen: a machine learning system that can generate graphics from text input.  ( 6 min )
  • Open

    New Open-Source Bloom AI To Challenge OpenAI & Google Deepmind | Breakthrough Chemical AI System
    submitted by /u/tohelpyou88 [link] [comments]  ( 84 min )
    “The use of Neural Networks in predicting share prices”
    So how accurately can neural network predict the future prices of the stocks in the share market. If there are any good resources that could help me know more about it could you please share. submitted by /u/Suspicious_Speed24 [link] [comments]  ( 88 min )
    What math operations are the bottleneck for running inference on the edge? I’m trying to select an edge accelerator for a product and the industry seems to be fairly immature and it’s difficult to compare units.
    To expand, it seems like there are countless accelerators intended to speed up NN inference on the market. Way more than I have the bandwidth to individually setup environments for testing and evaluation. The other factor is that since the academic world is changing fast it seems challenging to predict which math operations will end up being the standard, kind of like how the CPU world landed on x86 and RISC. I get the feeling that since everything is in flux, the long hardware development cycles mean we are taking a long time to stabilise and agree on the best ASICS to design. So in the meantime, I am seeking some recommendations for what the instructions the current state of the art neural networks (let’s say, for computer vision) need. Matrix multiplication? Linear algebra? Fused multiply add? Is the expectation that this will change over time? Any info would be much appreciated. submitted by /u/meregizzardavowal [link] [comments]  ( 89 min )
  • Open

    NeuralGrasps: Learning Implicit Representations for Grasps of Multiple Robotic Hands. (arXiv:2207.02959v1 [cs.RO] CROSS LISTED)
    We introduce a neural implicit representation for grasps of objects from multiple robotic hands. Different grasps across multiple robotic hands are encoded into a shared latent space. Each latent vector is learned to decode to the 3D shape of an object and the 3D shape of a robotic hand in a grasping pose in terms of the signed distance functions of the two 3D shapes. In addition, the distance metric in the latent space is learned to preserve the similarity between grasps across different robotic hands, where the similarity of grasps is defined according to contact regions of the robotic hands. This property enables our method to transfer grasps between different grippers including a human hand, and grasp transfer has the potential to share grasping skills between robots and enable robots to learn grasping skills from humans. Furthermore, the encoded signed distance functions of objects and grasps in our implicit representation can be used for 6D object pose estimation with grasping contact optimization from partial point clouds, which enables robotic grasping in the real world.  ( 2 min )
    Investigating Generalization by Controlling Normalized Margin. (arXiv:2205.03940v2 [cs.LG] UPDATED)
    Weight norm $\|w\|$ and margin $\gamma$ participate in learning theory via the normalized margin $\gamma/\|w\|$. Since standard neural net optimizers do not control normalized margin, it is hard to test whether this quantity causally relates to generalization. This paper designs a series of experimental studies that explicitly control normalized margin and thereby tackle two central questions. First: does normalized margin always have a causal effect on generalization? The paper finds that no -- networks can be produced where normalized margin has seemingly no relationship with generalization, counter to the theory of Bartlett et al. (2017). Second: does normalized margin ever have a causal effect on generalization? The paper finds that yes -- in a standard training setup, test performance closely tracks normalized margin. The paper suggests a Gaussian process model as a promising explanation for this behavior.  ( 2 min )
    Spatiotemporal Feature Learning Based on Two-Step LSTM and Transformer for CT Scans. (arXiv:2207.01579v2 [eess.IV] UPDATED)
    Computed tomography (CT) imaging could be very practical for diagnosing various diseases. However, the nature of the CT images is even more diverse since the resolution and number of the slices of a CT scan are determined by the machine and its settings. Conventional deep learning models are hard to tickle such diverse data since the essential requirement of the deep neural network is the consistent shape of the input data. In this paper, we propose a novel, effective, two-step-wise approach to tickle this issue for COVID-19 symptom classification thoroughly. First, the semantic feature embedding of each slice for a CT scan is extracted by conventional backbone networks. Then, we proposed a long short-term memory (LSTM) and Transformer-based sub-network to deal with temporal feature learning, leading to spatiotemporal feature representation learning. In this fashion, the proposed two-step LSTM model could prevent overfitting, as well as increase performance. Comprehensive experiments reveal that the proposed two-step method not only shows excellent performance but also could be compensated for each other. More specifically, the two-step LSTM model has a lower false-negative rate, while the 2-step Swin model has a lower false-positive rate. In summary, it is suggested that the model ensemble could be adopted for more stable and promising performance in real-world applications.  ( 3 min )
    A Structured Span Selector. (arXiv:2205.03977v2 [cs.CL] UPDATED)
    Many natural language processing tasks, e.g., coreference resolution and semantic role labeling, require selecting text spans and making decisions about them. A typical approach to such tasks is to score all possible spans and greedily select spans for task-specific downstream processing. This approach, however, does not incorporate any inductive bias about what sort of spans ought to be selected, e.g., that selected spans tend to be syntactic constituents. In this paper, we propose a novel grammar-based structured span selection model which learns to make use of the partial span-level annotation provided for such problems. Compared to previous approaches, our approach gets rid of the heuristic greedy span selection scheme, allowing us to model the downstream task on an optimal set of spans. We evaluate our model on two popular span prediction tasks: coreference resolution and semantic role labeling. We show empirical improvements on both.  ( 2 min )
    Approximately Solving Mean Field Games via Entropy-Regularized Deep Reinforcement Learning. (arXiv:2102.01585v2 [cs.MA] UPDATED)
    The recent mean field game (MFG) formalism facilitates otherwise intractable computation of approximate Nash equilibria in many-agent settings. In this paper, we consider discrete-time finite MFGs subject to finite-horizon objectives. We show that all discrete-time finite MFGs with non-constant fixed point operators fail to be contractive as typically assumed in existing MFG literature, barring convergence via fixed point iteration. Instead, we incorporate entropy-regularization and Boltzmann policies into the fixed point iteration. As a result, we obtain provable convergence to approximate fixed points where existing methods fail, and reach the original goal of approximate Nash equilibria. All proposed methods are evaluated with respect to their exploitability, on both instructive examples with tractable exact solutions and high-dimensional problems where exact methods become intractable. In high-dimensional scenarios, we apply established deep reinforcement learning methods and empirically combine fictitious play with our approximations.  ( 2 min )
    Online Learning in Budget-Constrained Dynamic Colonel Blotto Games. (arXiv:2103.12833v3 [cs.LG] UPDATED)
    In this paper, we study the strategic allocation of limited resources using a Colonel Blotto game (CBG) under a dynamic setting and analyze the problem using an online learning approach. In this model, one of the players is the learner who has limited troops to allocate over a finite time horizon, and the other player is an adversary. At each stage, the learner plays a Colonel Blotto game with the adversary and strategically determines the distribution of troops among battlefields based on past observations. The adversary chooses its allocation strategy randomly from some fixed distribution that is unknown to the learner. The learner's objective is to minimize its regret, which is the difference between the payoff of the best mixed strategy and the realized payoff by following a learning algorithm while not violating the budget constraint. The learning in dynamic CBG is analyzed under the framework of combinatorial bandit and bandit with knapsacks. We first convert the budget-constrained dynamic CBG to a path planning problem on a directed graph. We then devise an efficient algorithm that combines a special combinatorial bandit algorithm Edge for the path planning problem and a bandit with knapsack algorithm LagrangeBwK to cope with the budget constraint. The theoretical analysis shows that the learner's regret is bounded by a term sublinear in time horizon and polynomial in other parameters. Finally, we justify our theoretical results by performing simulations for various scenarios.  ( 3 min )
    Distributed Saddle-Point Problems: Lower Bounds, Near-Optimal and Robust Algorithms. (arXiv:2010.13112v8 [cs.LG] UPDATED)
    This paper focuses on the distributed optimization of stochastic saddle point problems. The first part of the paper is devoted to lower bounds for the cenralized and decentralized distributed methods for smooth (strongly) convex-(strongly) concave saddle-point problems as well as the near-optimal algorithms by which these bounds are achieved. Next, we present a new federated algorithm for cenralized distributed saddle point problems - Extra Step Local SGD. Theoretical analysis of the new method is carried out for strongly convex-strongly concave and non-convex-non-concave problems. In the experimental part of the paper, we show the effectiveness of our method in practice. In particular, we train GANs in a distributed manner.  ( 2 min )
    Risk aversion in learning algorithms and recommendation systems. (arXiv:2205.04619v2 [cs.LG] UPDATED)
    Consider online learning algorithms that simultaneously make decisions and learn from feedback. Such algorithms are widely deployed in recommendation systems for products and digital content. This article exhibits a bias of online learning algorithms towards less risky alternatives, and how it shapes demand on recommendation systems. First, we consider $k$-armed bandits. We prove that $\varepsilon$-Greedy chooses a riskless arm over a risky arm of equal expected reward with probability arbitrarily close to one. This is a consequence of undersampling of arms with bad reward estimates. Through experiments, we show that other online learning algorithms exhibit risk aversion as well. In a recommendation system environment we show that content that yields less noisy reward from users is favored by the algorithm. Combined with equilibrium forces driving strategic content creators towards content of similar expected quality, the advantage for content that is not necessarily better, just less volatile, is exaggerated.  ( 2 min )
    Sparsity and Heterogeneous Dropout for Continual Learning in the Null Space of Neural Activations. (arXiv:2203.06514v2 [cs.LG] UPDATED)
    Continual/lifelong learning from a non-stationary input data stream is a cornerstone of intelligence. Despite their phenomenal performance in a wide variety of applications, deep neural networks are prone to forgetting their previously learned information upon learning new ones. This phenomenon is called "catastrophic forgetting" and is deeply rooted in the stability-plasticity dilemma. Overcoming catastrophic forgetting in deep neural networks has become an active field of research in recent years. In particular, gradient projection-based methods have recently shown exceptional performance at overcoming catastrophic forgetting. This paper proposes two biologically-inspired mechanisms based on sparsity and heterogeneous dropout that significantly increase a continual learner's performance over a long sequence of tasks. Our proposed approach builds on the Gradient Projection Memory (GPM) framework. We leverage k-winner activations in each layer of a neural network to enforce layer-wise sparse activations for each task, together with a between-task heterogeneous dropout that encourages the network to use non-overlapping activation patterns between different tasks. In addition, we introduce two new benchmarks for continual learning under distributional shift, namely Continual Swiss Roll and ImageNet SuperDog-40. Lastly, we provide an in-depth analysis of our proposed method and demonstrate a significant performance boost on various benchmark continual learning problems.  ( 3 min )
    Bayesian Optimization Over Iterative Learners with Structured Responses: A Budget-aware Planning Approach. (arXiv:2206.12708v2 [cs.LG] UPDATED)
    The rising growth of deep neural networks (DNNs) and datasets in size motivates the need for efficient solutions for simultaneous model selection and training. Many methods for hyperparameter optimization (HPO) of iterative learners including DNNs attempt to solve this problem by querying and learning a response surface while searching for the optimum of that surface. However, many of these methods make myopic queries, do not consider prior knowledge about the response structure, and/or perform biased cost-aware search, all of which exacerbate identifying the best-performing model when a total cost budget is specified. This paper proposes a novel approach referred to as Budget-Aware Planning for Iterative Learners (BAPI) to solve HPO problems under a constrained cost budget. BAPI is an efficient non-myopic Bayesian optimization solution that accounts for the budget and leverages the prior knowledge about the objective function and cost function to select better configurations and to take more informed decisions during the evaluation (training). Experiments on diverse HPO benchmarks for iterative learners show that BAPI performs better than state-of-the-art baselines in most of the cases.  ( 2 min )
    Flexible Group Fairness Metrics for Survival Analysis. (arXiv:2206.03256v2 [cs.CY] UPDATED)
    Algorithmic fairness is an increasingly important field concerned with detecting and mitigating biases in machine learning models. There has been a wealth of literature for algorithmic fairness in regression and classification however there has been little exploration of the field for survival analysis. Survival analysis is the prediction task in which one attempts to predict the probability of an event occurring over time. Survival predictions are particularly important in sensitive settings such as when utilising machine learning for diagnosis and prognosis of patients. In this paper we explore how to utilise existing survival metrics to measure bias with group fairness metrics. We explore this in an empirical experiment with 29 survival datasets and 8 measures. We find that measures of discrimination are able to capture bias well whereas there is less clarity with measures of calibration and scoring rules. We suggest further areas for research including prediction-based fairness metrics for distribution predictions.  ( 2 min )
    Data-driven Numerical Invariant Synthesis with Automatic Generation of Attributes. (arXiv:2205.14943v3 [cs.PL] UPDATED)
    We propose a data-driven algorithm for numerical invariant synthesis and verification. The algorithm is based on the ICE-DT schema for learning decision trees from samples of positive and negative states and implications corresponding to program transitions. The main issue we address is the discovery of relevant attributes to be used in the learning process of numerical invariants. We define a method for solving this problem guided by the data sample. It is based on the construction of a separator that covers positive states and excludes negative ones, consistent with the implications. The separator is constructed using an abstract domain representation of convex sets. The generalization mechanism of the decision tree learning from the constraints of the separator allows the inference of general invariants, accurate enough for proving the targeted property. We implemented our algorithm and showed its efficiency.  ( 2 min )
    Rich Feature Construction for the Optimization-Generalization Dilemma. (arXiv:2203.15516v2 [cs.LG] UPDATED)
    There often is a dilemma between ease of optimization and robust out-of-distribution (OoD) generalization. For instance, many OoD methods rely on penalty terms whose optimization is challenging. They are either too strong to optimize reliably or too weak to achieve their goals. We propose to initialize the networks with a rich representation containing a palette of potentially useful features, ready to be used by even simple models. On the one hand, a rich representation provides a good initialization for the optimizer. On the other hand, it also provides an inductive bias that helps OoD generalization. Such a representation is constructed with the Rich Feature Construction (RFC) algorithm, also called the Bonsai algorithm, which consists of a succession of training episodes. During discovery episodes, we craft a multi-objective optimization criterion and its associated datasets in a manner that prevents the network from using the features constructed in the previous iterations. During synthesis episodes, we use knowledge distillation to force the network to simultaneously represent all the previously discovered features. Initializing the networks with Bonsai representations consistently helps six OoD methods achieve top performance on ColoredMNIST benchmark. The same technique substantially outperforms comparable results on the Wilds Camelyon17 task, eliminates the high result variance that plagues other methods, and makes hyperparameter tuning and model selection more reliable.  ( 3 min )
    Your Policy Regularizer is Secretly an Adversary. (arXiv:2203.12592v4 [cs.LG] UPDATED)
    Policy regularization methods such as maximum entropy regularization are widely used in reinforcement learning to improve the robustness of a learned policy. In this paper, we show how this robustness arises from hedging against worst-case perturbations of the reward function, which are chosen from a limited set by an imagined adversary. Using convex duality, we characterize this robust set of adversarial reward perturbations under KL and alpha-divergence regularization, which includes Shannon and Tsallis entropy regularization as special cases. Importantly, generalization guarantees can be given within this robust set. We provide detailed discussion of the worst-case reward perturbations, and present intuitive empirical examples to illustrate this robustness and its relationship with generalization. Finally, we discuss how our analysis complements and extends previous results on adversarial reward robustness and path consistency optimality conditions.  ( 2 min )
    Neuro-Inspired Deep Neural Networks with Sparse, Strong Activations. (arXiv:2202.13074v3 [cs.NE] UPDATED)
    While end-to-end training of Deep Neural Networks (DNNs) yields state of the art performance in an increasing array of applications, it does not provide insight into, or control over, the features being extracted. We report here on a promising neuro-inspired approach to DNNs with sparser and stronger activations. We use standard stochastic gradient training, supplementing the end-to-end discriminative cost function with layer-wise costs promoting Hebbian ("fire together," "wire together") updates for highly active neurons, and anti-Hebbian updates for the remaining neurons. Instead of batch norm, we use divisive normalization of activations (suppressing weak outputs using strong outputs), along with implicit $\ell_2$ normalization of neuronal weights. Experiments with standard image classification tasks on CIFAR-10 demonstrate that, relative to baseline end-to-end trained architectures, our proposed architecture (a) leads to sparser activations (with only a slight compromise on accuracy), (b) exhibits more robustness to noise (without being trained on noisy data), (c) exhibits more robustness to adversarial perturbations (without adversarial training).  ( 2 min )
    Standard Vs Uniform Binary Search and Their Variants in Learned Static Indexing: The Case of the Searching on Sorted Data Benchmarking Software Platform. (arXiv:2201.01554v2 [cs.DS] UPDATED)
    Learned Indexes are a novel approach to search in a sorted table. A model is used to predict an interval in which to search into and a Binary Search routine is used to finalize the search. They are quite effective. For the final stage, usually, the lower_bound routine of the Standard C++ library is used, although this is more of a natural choice rather than a requirement. However, recent studies, that do not use Machine Learning predictions, indicate that other implementations of Binary Search or variants, namely k-ary Search, are better suited to take advantage of the features offered by modern computer architectures. With the use of the Searching on Sorted Sets SOSD Learned Indexing benchmarking software, we investigate how to choose a Search routine for the final stage of searching in a Learned Index. Our results provide indications that better choices than the lower_bound routine can be made. We also highlight how such a choice may be dependent on the computer architecture that is to be used. Overall, our findings provide new and much-needed guidelines for the selection of the Search routine within the Learned Indexing framework.  ( 3 min )
    Optimal sizing of a holdout set for safe predictive model updating. (arXiv:2202.06374v3 [stat.ML] UPDATED)
    Predictive risk scores are increasingly used to guide clinical or other interventions in complex settings, particularly healthcare. Directly updating a risk score used to guide interventions leads to biased risk estimates. We propose updating using a `holdout set' -- a subset of the population that does not receive risk-score-guided interventions -- to prevent this. Since samples in the holdout set do not benefit from risk predictions, its size must trade off performance of the updated risk score whilst minimising the number of held out samples. We prove that this approach outperforms simple alternatives, and by defining a general loss function describe conditions under which an optimal holdout size (OHS) can be readily identified. We introduce parametric and semi-parametric algorithms for OHS estimation and demonstrate their use on a recent risk score for pre-eclampsia. Based on these results, we argue that a holdout set is a safe, viable and easily implemented means to safely update predictive risk scores.  ( 2 min )
    High Throughput Multi-Channel Parallelized Diffraction Convolutional Neural Network Accelerator. (arXiv:2112.12297v2 [cs.LG] UPDATED)
    Convolutional neural networks are paramount in image and signal processing including the relevant classification and training tasks alike and constitute for the majority of machine learning compute demand today. With convolution operations being computationally intensive, next generation hardware accelerators need to offer parallelization and algorithmic-hardware homomorphism. Fortunately, diffractive display optics is capable of million-channel parallel data processing at low latency, however, thus far only showed tens of Hertz slow single image and kernel capability, thereby significantly underdelivering from its performance potential. Here, we demonstrate an operation-parallelized high-throughput Fourier optic convolutional neural network accelerator. For the first time simultaneously processing of multiple kernels in Fourier domain enabled by optical diffraction has been achieved alongside with already conventional in the field input parallelism. Additionally, we show an about one hundred times system speed up over existing optical diffraction-based processors and this demonstration rivals performance of modern electronic solutions. Therefore, this system is capable of processing large-scale matrices about ten times faster than state of art electronic systems.  ( 2 min )
    Understanding Gradual Domain Adaptation: Improved Analysis, Optimal Path and Beyond. (arXiv:2204.08200v2 [cs.LG] UPDATED)
    The vast majority of existing algorithms for unsupervised domain adaptation (UDA) focus on adapting from a labeled source domain to an unlabeled target domain directly in a one-off way. Gradual domain adaptation (GDA), on the other hand, assumes a path of $(T-1)$ unlabeled intermediate domains bridging the source and target, and aims to provide better generalization in the target domain by leveraging the intermediate ones. Under certain assumptions, Kumar et al. (2020) proposed a simple algorithm, Gradual Self-Training, along with a generalization bound in the order of $e^{O(T)} \left(\varepsilon_0+O\left(\sqrt{log(T)/n}\right)\right)$ for the target domain error, where $\varepsilon_0$ is the source domain error and $n$ is the data size of each domain. Due to the exponential factor, this upper bound becomes vacuous when $T$ is only moderately large. In this work, we analyze gradual self-training under more general and relaxed assumptions, and prove a significantly improved generalization bound as $\varepsilon_0+ O \left(T\Delta + T/\sqrt{n}\right) + \widetilde{O}\left(1/\sqrt{nT}\right)$, where $\Delta$ is the average distributional distance between consecutive domains. Compared with the existing bound with an exponential dependency on $T$ as a multiplicative factor, our bound only depends on $T$ linearly and additively. Perhaps more interestingly, our result implies the existence of an optimal choice of $T$ that minimizes the generalization error, and it also naturally suggests an optimal way to construct the path of intermediate domains so as to minimize the accumulative path length $T\Delta$ between the source and target. To corroborate the implications of our theory, we examine gradual self-training on multiple semi-synthetic and real datasets, which confirms our findings. We believe our insights provide a path forward toward the design of future GDA algorithms.  ( 3 min )
    The Importance of Non-Markovianity in Maximum State Entropy Exploration. (arXiv:2202.03060v2 [cs.LG] UPDATED)
    In the maximum state entropy exploration framework, an agent interacts with a reward-free environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting non-Markovianity is generally considered pointless in this setting. In this paper, we argue that non-Markovianity is instead paramount for maximum state entropy exploration in a finite-sample regime. Especially, we recast the objective to target the expected entropy of the induced state visitations in a single trial. Then, we show that the class of non-Markovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer non-zero regret in general. However, we prove that the problem of finding an optimal non-Markovian policy is NP-hard. Despite this negative result, we discuss avenues to address the problem in a tractable way and how non-Markovian exploration could benefit the sample efficiency of online reinforcement learning in future works.  ( 2 min )
    Towards Effective and Robust Neural Trojan Defenses via Input Filtering. (arXiv:2202.12154v4 [cs.CR] UPDATED)
    Trojan attacks on deep neural networks are both dangerous and surreptitious. Over the past few years, Trojan attacks have advanced from using only a single input-agnostic trigger and targeting only one class to using multiple, input-specific triggers and targeting multiple classes. However, Trojan defenses have not caught up with this development. Most defense methods still make inadequate assumptions about Trojan triggers and target classes, thus, can be easily circumvented by modern Trojan attacks. To deal with this problem, we propose two novel "filtering" defenses called Variational Input Filtering (VIF) and Adversarial Input Filtering (AIF) which leverage lossy data compression and adversarial learning respectively to effectively purify potential Trojan triggers in the input at run time without making assumptions about the number of triggers/target classes or the input dependence property of triggers. In addition, we introduce a new defense mechanism called "Filtering-then-Contrasting" (FtC) which helps avoid the drop in classification accuracy on clean data caused by "filtering", and combine it with VIF/AIF to derive new defenses of this kind. Extensive experimental results and ablation studies show that our proposed defenses significantly outperform well-known baseline defenses in mitigating five advanced Trojan attacks including two recent state-of-the-art while being quite robust to small amounts of training data and large-norm triggers.  ( 3 min )
    Evaluating Causal Inference Methods. (arXiv:2202.04208v3 [stat.ME] UPDATED)
    The fundamental challenge of drawing causal inference is that counterfactual outcomes are not fully observed for any unit. Furthermore, in observational studies, treatment assignment is likely to be confounded. Many statistical methods have emerged for causal inference under unconfoundedness conditions given pre-treatment covariates, including propensity score-based methods, prognostic score-based methods, and doubly robust methods. Unfortunately for applied researchers, there is no `one-size-fits-all' causal method that can perform optimally universally. In practice, causal methods are primarily evaluated quantitatively on handcrafted simulated data. Such data-generative procedures can be of limited value because they are typically stylized models of reality. They are simplified for tractability and lack the complexities of real-world data. For applied researchers, it is critical to understand how well a method performs for the data at hand. Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods. The framework's novelty stems from its ability to generate synthetic data anchored at the empirical distribution for the observed sample, and therefore virtually indistinguishable from the latter. The approach allows the user to specify ground truth for the form and magnitude of causal effects and confounding bias as functions of covariates. Thus simulated data sets are used to evaluate the potential performance of various causal estimation methods when applied to data similar to the observed sample. We demonstrate Credence's ability to accurately assess the relative performance of causal estimation techniques in an extensive simulation study and two real-world data applications from Lalonde and Project STAR studies.  ( 3 min )
    Invariant Ancestry Search. (arXiv:2202.00913v2 [stat.ME] UPDATED)
    Recently, methods have been proposed that exploit the invariance of prediction models with respect to changing environments to infer subsets of the causal parents of a response variable. If the environments influence only few of the underlying mechanisms, the subset identified by invariant causal prediction (ICP), for example, may be small, or even empty. We introduce the concept of minimal invariance and propose invariant ancestry search (IAS). In its population version, IAS outputs a set which contains only ancestors of the response and is a superset of the output of ICP. When applied to data, corresponding guarantees hold asymptotically if the underlying test for invariance has asymptotic level and power. We develop scalable algorithms and perform experiments on simulated and real data.  ( 2 min )
    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v2 [stat.ML] UPDATED)
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the used data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data and during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.  ( 2 min )
    GraphWorld: Fake Graphs Bring Real Insights for GNNs. (arXiv:2203.00112v2 [cs.LG] UPDATED)
    Despite advances in the field of Graph Neural Networks (GNNs), only a small number (~5) of datasets are currently used to evaluate new models. This continued reliance on a handful of datasets provides minimal insight into the performance differences between models, and is especially challenging for industrial practitioners who are likely to have datasets which look very different from those used as academic benchmarks. In the course of our work on GNN infrastructure and open-source software at Google, we have sought to develop improved benchmarks that are robust, tunable, scalable,and generalizable. In this work we introduce GraphWorld, a novel methodology and system for benchmarking GNN models on an arbitrarily-large population of synthetic graphs for any conceivable GNN task. GraphWorld allows a user to efficiently generate a world with millions of statistically diverse datasets. It is accessible, scalable, and easy to use. GraphWorld can be run on a single machine without specialized hardware, or it can be easily scaled up to run on arbitrary clusters or cloud frameworks. Using GraphWorld, a user has fine-grained control over graph generator parameters, and can benchmark arbitrary GNN models with built-in hyperparameter tuning. We present insights from GraphWorld experiments regarding the performance characteristics of tens of thousands of GNN models over millions of benchmark datasets. We further show that GraphWorld efficiently explores regions of benchmark dataset space uncovered by standard benchmarks, revealing comparisons between models that have not been historically obtainable. Using GraphWorld, we also are able to study in-detail the relationship between graph properties and task performance metrics, which is nearly impossible with the classic collection of real-world benchmarks.  ( 3 min )
    Architecture Agnostic Federated Learning for Neural Networks. (arXiv:2202.07757v3 [cs.LG] UPDATED)
    With growing concerns regarding data privacy and rapid increase in data volume, Federated Learning(FL) has become an important learning paradigm. However, jointly learning a deep neural network model in a FL setting proves to be a non-trivial task because of the complexities associated with the neural networks, such as varied architectures across clients, permutation invariance of the neurons, and presence of non-linear transformations in each layer. This work introduces a novel Federated Heterogeneous Neural Networks (FedHeNN) framework that allows each client to build a personalised model without enforcing a common architecture across clients. This allows each client to optimize with respect to local data and compute constraints, while still benefiting from the learnings of other (potentially more powerful) clients. The key idea of FedHeNN is to use the instance-level representations obtained from peer clients to guide the simultaneous training on each client. The extensive experimental results demonstrate that the FedHeNN framework is capable of learning better performing models on clients in both the settings of homogeneous and heterogeneous architectures across clients.  ( 2 min )
    Multi-Task Learning as a Bargaining Game. (arXiv:2202.01017v2 [cs.LG] UPDATED)
    In Multi-task learning (MTL), a joint model is trained to simultaneously make predictions for several tasks. Joint training reduces computation costs and improves data efficiency; however, since the gradients of these different tasks may conflict, training a joint model for MTL often yields lower performance than its corresponding single-task counterparts. A common method for alleviating this issue is to combine per-task gradients into a joint update direction using a particular heuristic. In this paper, we propose viewing the gradients combination step as a bargaining game, where tasks negotiate to reach an agreement on a joint direction of parameter update. Under certain assumptions, the bargaining problem has a unique solution, known as the Nash Bargaining Solution, which we propose to use as a principled approach to multi-task learning. We describe a new MTL optimization procedure, Nash-MTL, and derive theoretical guarantees for its convergence. Empirically, we show that Nash-MTL achieves state-of-the-art results on multiple MTL benchmarks in various domains.  ( 2 min )
    TEA: A Sequential Recommendation Framework via Temporally Evolving Aggregations. (arXiv:2111.07378v2 [cs.IR] UPDATED)
    Sequential recommendation aims to choose the most suitable items for a user at a specific timestamp given historical behaviors. Existing methods usually model the user behavior sequence based on the transition-based methods like Markov Chain. However, these methods also implicitly assume that the users are independent of each other without considering the influence between users. In fact, this influence plays an important role in sequence recommendation since the behavior of a user is easily affected by others. Therefore, it is desirable to aggregate both user behaviors and the influence between users, which are evolved temporally and involved in the heterogeneous graph of users and items. In this paper, we incorporate dynamic user-item heterogeneous graphs to propose a novel sequential recommendation framework. As a result, the historical behaviors as well as the influence between users can be taken into consideration. To achieve this, we firstly formalize sequential recommendation as a problem to estimate conditional probability given temporal dynamic heterogeneous graphs and user behavior sequences. After that, we exploit the conditional random field to aggregate the heterogeneous graphs and user behaviors for probability estimation, and employ the pseudo-likelihood approach to derive a tractable objective function. Finally, we provide scalable and flexible implementations of the proposed framework. Experimental results on three real-world datasets not only demonstrate the effectiveness of our proposed method but also provide some insightful discoveries on sequential recommendation.  ( 3 min )
    Test Sample Accuracy Scales with Training Sample Density in Neural Networks. (arXiv:2106.08365v6 [cs.LG] UPDATED)
    Intuitively, one would expect accuracy of a trained neural network's prediction on test samples to correlate with how densely the samples are surrounded by seen training samples in representation space. We find that a bound on empirical training error smoothed across linear activation regions scales inversely with training sample density in representation space. Empirically, we verify this bound is a strong predictor of the inaccuracy of the network's prediction on test samples. For unseen test sets, including those with out-of-distribution samples, ranking test samples by their local region's error bound and discarding samples with the highest bounds raises prediction accuracy by up to 20% in absolute terms for image classification datasets, on average over thresholds.  ( 2 min )
    A methodology for training homomorphicencryption friendly neural networks. (arXiv:2111.03362v3 [cs.CR] UPDATED)
    Privacy-preserving deep neural network (DNN) inference is a necessity in different regulated industries such as healthcare, finance and retail. Recently, homomorphic encryption (HE) has been used as a method to enable analytics while addressing privacy concerns. HE enables secure predictions over encrypted data. However, there are several challenges related to the use of HE, including DNN size limitations and the lack of support for some operation types. Most notably, the commonly used ReLU activation is not supported under some HE schemes. We propose a structured methodology to replace ReLU with a quadratic polynomial activation. To address the accuracy degradation issue, we use a pre-trained model that trains another HE-friendly model, using techniques such as trainable activation functions and knowledge distillation. We demonstrate our methodology on the AlexNet architecture, using the chest X-Ray and CT datasets for COVID-19 detection. Experiments using our approach reduced the gap between the F1 score and accuracy of the models trained with ReLU and the HE-friendly model to within a mere 0.32-5.3 percent degradation. We also demonstrate our methodology using the SqueezeNet architecture, for which we observed 7 percent accuracy and F1 improvements over training similar networks with other HE-friendly training methods.  ( 3 min )
    DeepSplit: Scalable Verification of Deep Neural Networks via Operator Splitting. (arXiv:2106.09117v3 [cs.LG] UPDATED)
    Analyzing the worst-case performance of deep neural networks against input perturbations amounts to solving a large-scale non-convex optimization problem, for which several past works have proposed convex relaxations as a promising alternative. However, even for reasonably-sized neural networks, these relaxations are not tractable, and so must be replaced by even weaker relaxations in practice. In this work, we propose a novel operator splitting method that can directly solve a convex relaxation of the problem to high accuracy, by splitting it into smaller sub-problems that often have analytical solutions. The method is modular, scales to very large problem instances, and compromises operations that are amenable to fast parallelization with GPU acceleration. We demonstrate our method in bounding the worst-case performance of large convolutional networks in image classification and reinforcement learning settings, and in reachability analysis of neural network dynamical systems.  ( 2 min )
    Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning. (arXiv:2112.08932v2 [cs.LG] UPDATED)
    Effective exploration continues to be a significant challenge that prevents the deployment of reinforcement learning for many physical systems. This is particularly true for systems with continuous and high-dimensional state and action spaces, such as robotic manipulators. The challenge is accentuated in the sparse rewards setting, where the low-level state information required for the design of dense rewards is unavailable. Adversarial imitation learning (AIL) can partially overcome this barrier by leveraging expert-generated demonstrations of optimal behaviour and providing, essentially, a replacement for dense reward information. Unfortunately, the availability of expert demonstrations does not necessarily improve an agent's capability to explore effectively and, as we empirically show, can lead to inefficient or stagnated learning. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of, in addition to a main task, multiple auxiliary tasks. Subsequently, a hierarchical model is used to learn each task reward and policy through a modified AIL procedure, in which exploration of all tasks is enforced via a scheduler composing different tasks together. This affords many benefits: learning efficiency is improved for main tasks with challenging bottleneck transitions, expert data becomes reusable between tasks, and transfer learning through the reuse of learned auxiliary task models becomes possible. Our experimental results in a challenging multitask robotic manipulation domain indicate that our method compares favourably to supervised imitation learning and to a state-of-the-art AIL method. Code is available at https://github.com/utiasSTARS/lfgp.  ( 3 min )
    Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities. (arXiv:2111.08851v3 [cs.LG] UPDATED)
    In recent times, deep neural networks achieved outstanding predictive performance on various classification and pattern recognition tasks. However, many real-world prediction problems have ordinal response variables, and this ordering information is ignored by conventional classification losses such as the multi-category cross-entropy. Ordinal regression methods for deep neural networks address this. One such method is the CORAL method, which is based on an earlier binary label extension framework and achieves rank consistency among its output layer tasks by imposing a weight-sharing constraint. However, while earlier experiments showed that CORAL's rank consistency is beneficial for performance, {it is limited by a weight-sharing constraint in a neural network's fully connected output layer. We propose a new method for rank-consistent ordinal regression without this limitation. Our rank-consistent ordinal regression framework (CORN) achieves rank consistency by a novel training scheme. This training scheme uses} conditional training sets to obtain the unconditional rank probabilities through applying the chain rule for conditional probability distributions. Experiments on various datasets demonstrate the efficacy of the proposed method to utilize the ordinal target information, and the absence of the weight-sharing restriction improves the performance substantially compared to the CORAL reference approach.  ( 3 min )
    Supervising the Decoder of Variational Autoencoders to Improve Scientific Utility. (arXiv:2109.04561v3 [stat.ML] UPDATED)
    Probabilistic generative models are attractive for scientific modeling because their inferred parameters can be used to generate hypotheses and design experiments. This requires that the learned model provide an accurate representation of the input data and yield a latent space that effectively predicts outcomes relevant to the scientific question. Supervised Variational Autoencoders (SVAEs) have previously been used for this purpose, where a carefully designed decoder can be used as an interpretable generative model while the supervised objective ensures a predictive latent representation. Unfortunately, the supervised objective forces the encoder to learn a biased approximation to the generative posterior distribution, which renders the generative parameters unreliable when used in scientific models. This issue has remained undetected as reconstruction losses commonly used to evaluate model performance do not detect bias in the encoder. We address this previously-unreported issue by developing a second order supervision framework (SOS-VAE) that influences the decoder to induce a predictive latent representation. This ensures that the associated encoder maintains a reliable generative interpretation. We extend this technique to allow the user to trade-off some bias in the generative parameters for improved predictive performance, acting as an intermediate option between SVAEs and our new SOS-VAE. We also use this methodology to address missing data issues that often arise when combining recordings from multiple scientific experiments. We demonstrate the effectiveness of these developments using synthetic data and electrophysiological recordings with an emphasis on how our learned representations can be used to design scientific experiments.  ( 3 min )
    On Improving the Performance of Glitch Classification for Gravitational Wave Detection by using Generative Adversarial Networks. (arXiv:2207.04001v1 [astro-ph.HE])
    Spectrogram classification plays an important role in analyzing gravitational wave data. In this paper, we propose a framework to improve the classification performance by using Generative Adversarial Networks (GANs). As substantial efforts and expertise are required to annotate spectrograms, the number of training examples is very limited. However, it is well known that deep networks can perform well only when the sample size of the training set is sufficiently large. Furthermore, the imbalanced sample sizes in different classes can also hamper the performance. In order to tackle these problems, we propose a GAN-based data augmentation framework. While standard data augmentation methods for conventional images cannot be applied on spectrograms, we found that a variant of GANs, ProGAN, is capable of generating high-resolution spectrograms which are consistent with the quality of the high-resolution original images and provide a desirable diversity. We have validated our framework by classifying glitches in the {\it Gravity Spy} dataset with the GAN-generated spectrograms for training. We show that the proposed method can provide an alternative to transfer learning for the classification of spectrograms using deep networks, i.e. using a high-resolution GAN for data augmentation instead. Furthermore, fluctuations in classification performance with small sample sizes for training and evaluation can be greatly reduced. Using the trained network in our framework, we have also examined the spectrograms with label anomalies in {\it Gravity Spy}.  ( 3 min )
    BF++: a language for general-purpose program synthesis. (arXiv:2101.09571v6 [cs.AI] UPDATED)
    Most state of the art decision systems based on Reinforcement Learning (RL) are data-driven black-box neural models, where it is often difficult to incorporate expert knowledge into the models or let experts review and validate the learned decision mechanisms. Knowledge-insertion and model review are important requirements in many applications involving human health and safety. One way to bridge the gap between data and knowledge driven systems is program synthesis: replacing a neural network that outputs decisions with a symbolic program generated by a neural network or by means of genetic programming. We propose a new programming language, BF++, designed specifically for automatic programming of agents in a Partially Observable Markov Decision Process (POMDP) setting and apply neural program synthesis to solve standard OpenAI Gym benchmarks.  ( 2 min )
    Fair Exploration via Axiomatic Bargaining. (arXiv:2106.02553v2 [cs.LG] UPDATED)
    Exploration is often necessary in online learning to maximize long-term reward, but it comes at the cost of short-term 'regret'. We study how this cost of exploration is shared across multiple groups. For example, in a clinical trial setting, patients who are assigned a sub-optimal treatment effectively incur the cost of exploration. When patients are associated with natural groups on the basis of, say, race or age, it is natural to ask whether the cost of exploration borne by any single group is 'fair'. So motivated, we introduce the 'grouped' bandit model. We leverage the theory of axiomatic bargaining, and the Nash bargaining solution in particular, to formalize what might constitute a fair division of the cost of exploration across groups. On the one hand, we show that any regret-optimal policy strikingly results in the least fair outcome: such policies will perversely leverage the most 'disadvantaged' groups when they can. More constructively, we derive policies that are optimally fair and simultaneously enjoy a small 'price of fairness'. We illustrate the relative merits of our algorithmic framework with a case study on contextual bandits for warfarin dosing where we are concerned with the cost of exploration across multiple races and age groups.  ( 3 min )
    Seeing All the Angles: Learning Multiview Manipulation Policies for Contact-Rich Tasks from Demonstrations. (arXiv:2104.13907v3 [cs.RO] UPDATED)
    Learned visuomotor policies have shown considerable success as an alternative to traditional, hand-crafted frameworks for robotic manipulation. Surprisingly, an extension of these methods to the multiview domain is relatively unexplored. A successful multiview policy could be deployed on a mobile manipulation platform, allowing the robot to complete a task regardless of its view of the scene. In this work, we demonstrate that a multiview policy can be found through imitation learning by collecting data from a variety of viewpoints. We illustrate the general applicability of the method by learning to complete several challenging multi-stage and contact-rich tasks, from numerous viewpoints, both in a simulated environment and on a real mobile manipulation platform. Furthermore, we analyze our policies to determine the benefits of learning from multiview data compared to learning with data collected from a fixed perspective. We show that learning from multiview data results in little, if any, penalty to performance for a fixed-view task compared to learning with an equivalent amount of fixed-view data. Finally, we examine the visual features learned by the multiview and fixed-view policies. Our results indicate that multiview policies implicitly learn to identify spatially correlated features.  ( 3 min )
    Greedy Bayesian Posterior Approximation with Deep Ensembles. (arXiv:2105.14275v4 [cs.LG] UPDATED)
    Ensembles of independently trained neural networks are a state-of-the-art approach to estimate predictive uncertainty in Deep Learning, and can be interpreted as an approximation of the posterior distribution via a mixture of delta functions. The training of ensembles relies on non-convexity of the loss landscape and random initialization of their individual members, making the resulting posterior approximation uncontrolled. This paper proposes a novel and principled method to tackle this limitation, minimizing an $f$-divergence between the true posterior and a kernel density estimator (KDE) in a function space. We analyze this objective from a combinatorial point of view, and show that it is submodular with respect to mixture components for any $f$. Subsequently, we consider the problem of greedy ensemble construction. From the marginal gain on the negative $f$-divergence, which quantifies an improvement in posterior approximation yielded by adding a new component into the KDE, we derive a novel diversity term for ensemble methods. The performance of our approach is demonstrated on computer vision out-of-distribution detection benchmarks in a range of architectures trained on multiple datasets. The source code of our method is made publicly available at https://github.com/Oulu-IMEDS/greedy_ensembles_training.  ( 3 min )
    Feature Selection Methods for Uplift Modeling and Heterogeneous Treatment Effect. (arXiv:2005.03447v2 [cs.LG] UPDATED)
    Uplift modeling is a causal learning technique that estimates subgroup-level treatment effects. It is commonly used in industry and elsewhere for tasks such as targeting ads. In a typical setting, uplift models can take thousands of features as inputs, which is costly and results in problems such as overfitting and poor model interpretability. Consequently, there is a need to select a subset of the most important features for modeling. However, traditional methods for doing feature selection are not fit for the task because they are designed for standard machine learning models whose target is importantly different from uplift models. To address this, we introduce a set of feature selection methods explicitly designed for uplift modeling, drawing inspiration from statistics and information theory. We conduct empirical evaluations on the proposed methods on publicly available datasets, demonstrating the advantages of the proposed methods compared to traditional feature selection. We make the proposed methods publicly available as a part of the CausalML open-source package.  ( 2 min )
    Unpaired Single-Image Depth Synthesis with cycle-consistent Wasserstein GANs. (arXiv:2103.16938v3 [cs.CV] UPDATED)
    Real-time estimation of actual environment depth is an essential module for various autonomous system tasks such as localization, obstacle detection and pose estimation. During the last decade of machine learning, extensive deployment of deep learning methods to computer vision tasks yielded successful approaches for realistic depth synthesis out of a simple RGB modality. While most of these models rest on paired depth data or availability of video sequences and stereo images, there is a lack of methods facing single-image depth synthesis in an unsupervised manner. Therefore, in this study, latest advancements in the field of generative neural networks are leveraged to fully unsupervised single-image depth synthesis. To be more exact, two cycle-consistent generators for RGB-to-depth and depth-to-RGB transfer are implemented and simultaneously optimized using the Wasserstein-1 distance. To ensure plausibility of the proposed method, we apply the models to a self acquised industrial data set as well as to the renown NYU Depth v2 data set, which allows comparison with existing approaches. The observed success in this study suggests high potential for unpaired single-image depth estimation in real world applications.  ( 3 min )
    Combining Machine Learning and Effective Feature Selection for Real-time Stock Trading in Variable Time-frames. (arXiv:2107.13148v2 [q-fin.TR] UPDATED)
    The unpredictability and volatility of the stock market render it challenging to make a substantial profit using any generalised scheme. Many previous studies tried different techniques to build a machine learning model, which can make a significant profit in the US stock market by performing live trading. However, very few studies have focused on the importance of finding the best features for a particular trading period. Our top approach used the performance to narrow down the features from a total of 148 to about 30. Furthermore, the top 25 features were dynamically selected before each time training our machine learning model. It uses ensemble learning with four classifiers: Gaussian Naive Bayes, Decision Tree, Logistic Regression with L1 regularization, and Stochastic Gradient Descent, to decide whether to go long or short on a particular stock. Our best model performed daily trade between July 2011 and January 2019, generating 54.35% profit. Finally, our work showcased that mixtures of weighted classifiers perform better than any individual predictor of making trading decisions in the stock market.  ( 3 min )
    Layer Adaptive Node Selection in Bayesian Neural Networks: Statistical Guarantees and Implementation Details. (arXiv:2108.11000v2 [stat.ML] UPDATED)
    Sparse deep neural networks have proven to be efficient for predictive model building in large-scale studies. Although several works have studied theoretical and numerical properties of sparse neural architectures, they have primarily focused on the edge selection. Sparsity through edge selection might be intuitively appealing; however, it does not necessarily reduce the structural complexity of a network. Instead pruning excessive nodes leads to a structurally sparse network with significant computational speedup during inference. To this end, we propose a Bayesian sparse solution using spike-and-slab Gaussian priors to allow for automatic node selection during training. The use of spike-and-slab prior alleviates the need of an ad-hoc thresholding rule for pruning. In addition, we adopt a variational Bayes approach to circumvent the computational challenges of traditional Markov Chain Monte Carlo (MCMC) implementation. In the context of node selection, we establish the fundamental result of variational posterior consistency together with the characterization of prior parameters. In contrast to the previous works, our theoretical development relaxes the assumptions of the equal number of nodes and uniform bounds on all network weights, thereby accommodating sparse networks with layer-dependent node structures or coefficient bounds. With a layer-wise characterization of prior inclusion probabilities, we discuss the optimal contraction rates of the variational posterior. We empirically demonstrate that our proposed approach outperforms the edge selection method in computational complexity with similar or better predictive performance. Our experimental evidence further substantiates that our theoretical work facilitates layer-wise optimal node recovery.  ( 3 min )
    Neighbors From Hell: Voltage Attacks Against Deep Learning Accelerators on Multi-Tenant FPGAs. (arXiv:2012.07242v2 [cs.CR] UPDATED)
    Field-programmable gate arrays (FPGAs) are becoming widely used accelerators for a myriad of datacenter applications due to their flexibility and energy efficiency. Among these applications, FPGAs have shown promising results in accelerating low-latency real-time deep learning (DL) inference, which is becoming an indispensable component of many end-user applications. With the emerging research direction towards virtualized cloud FPGAs that can be shared by multiple users, the security aspect of FPGA-based DL accelerators requires careful consideration. In this work, we evaluate the security of DL accelerators against voltage-based integrity attacks in a multitenant FPGA scenario. We first demonstrate the feasibility of such attacks on a state-of-the-art Stratix 10 card using different attacker circuits that are logically and physically isolated in a separate attacker role, and cannot be flagged as malicious circuits by conventional bitstream checkers. We show that aggressive clock gating, an effective power-saving technique, can also be a potential security threat in modern FPGAs. Then, we carry out the attack on a DL accelerator running ImageNet classification in the victim role to evaluate the inherent resilience of DL models against timing faults induced by the adversary. We find that even when using the strongest attacker circuit, the prediction accuracy of the DL accelerator is not compromised when running at its safe operating frequency. Furthermore, we can achieve 1.18-1.31x higher inference performance by over-clocking the DL accelerator without affecting its prediction accuracy.  ( 3 min )
    On the representation and learning of monotone triangular transport maps. (arXiv:2009.10303v2 [stat.ML] UPDATED)
    Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps$\unicode{x2014}$approximations of the Knothe$\unicode{x2013}$Rosenblatt (KR) rearrangement$\unicode{x2014}$are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes.  ( 3 min )
    Interlocking Backpropagation: Improving depthwise model-parallelism. (arXiv:2010.04116v3 [cs.LG] UPDATED)
    The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism can suffer from poor resource utilisation, which leads to wasted resources. In this work, we improve upon recent developments in an idealised model-parallel optimisation setting: local learning. Motivated by poor resource utilisation in the global setting and poor task performance in the local setting, we introduce a class of intermediary strategies between local and global learning referred to as interlocking backpropagation. These strategies preserve many of the compute-efficiency advantages of local optimisation, while recovering much of the task performance achieved by global optimisation. We assess our strategies on both image classification ResNets and Transformer language models, finding that our strategy consistently out-performs local learning in terms of task performance, and out-performs global learning in training efficiency.  ( 2 min )
    Bayesian Quantile and Expectile Optimisation. (arXiv:2001.04833v2 [stat.ML] UPDATED)
    Bayesian optimisation (BO) is widely used to optimise stochastic black box functions. While most BO approaches focus on optimising conditional expectations, many applications require risk-averse strategies and alternative criteria accounting for the distribution tails need to be considered. In this paper, we propose new variational models for Bayesian quantile and expectile regression that are well-suited for heteroscedastic noise settings. Our models consist of two latent Gaussian processes accounting respectively for the conditional quantile (or expectile) and the scale parameter of an asymmetric likelihood functions. Furthermore, we propose two BO strategies based on max-value entropy search and Thompson sampling, that are tailored to such models and that can accommodate large batches of points. Contrary to existing BO approaches for risk-averse optimisation, our strategies can directly optimise for the quantile and expectile, without requiring replicating observations or assuming a parametric form for the noise. As illustrated in the experimental section, the proposed approach clearly outperforms the state of the art in the heteroscedastic, non-Gaussian case.  ( 2 min )
    ElectroLens: Understanding Atomistic Simulations Through Spatially-resolved Visualization of High-dimensional Features. (arXiv:1908.08381v3 [cs.HC] UPDATED)
    In recent years, machine learning (ML) has gained significant popularity in the field of chemical informatics and electronic structure theory. These techniques often require researchers to engineer abstract "features" that encode chemical concepts into a mathematical form compatible with the input to machine-learning models. However, there is no existing tool to connect these abstract features back to the actual chemical system, making it difficult to diagnose failures and to build intuition about the meaning of the features. We present ElectroLens, a new visualization tool for high-dimensional spatially-resolved features to tackle this problem. The tool visualizes high-dimensional data sets for atomistic and electron environment features by a series of linked 3D views and 2D plots. The tool is able to connect different derived features and their corresponding regions in 3D via interactive selection. It is built to be scalable, and integrate with existing infrastructure.  ( 2 min )
    The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications. (arXiv:2207.04043v1 [cs.CL])
    Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.  ( 3 min )
    MACFE: A Meta-learning and Causality Based Feature Engineering Framework. (arXiv:2207.04010v1 [cs.LG])
    Feature engineering has become one of the most important steps to improve model prediction performance, and to produce quality datasets. However, this process requires non-trivial domain-knowledge which involves a time-consuming process. Thereby, automating such process has become an active area of research and of interest in industrial applications. In this paper, a novel method, called Meta-learning and Causality Based Feature Engineering (MACFE), is proposed; our method is based on the use of meta-learning, feature distribution encoding, and causality feature selection. In MACFE, meta-learning is used to find the best transformations, then the search is accelerated by pre-selecting "original" features given their causal relevance. Experimental evaluations on popular classification datasets show that MACFE can improve the prediction performance across eight classifiers, outperforms the current state-of-the-art methods in average by at least 6.54%, and obtains an improvement of 2.71% over the best previous works.  ( 2 min )
    Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent. (arXiv:2207.04036v1 [cs.LG])
    As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related Legendre function. Conversely, continuous mirror descent with any Legendre function can be viewed as gradient flow with a related commuting parametrization. The latter result relies upon Nash's embedding theorem.  ( 2 min )
    Communication Acceleration of Local Gradient Methods via an Accelerated Primal-Dual Algorithm with Inexact Prox. (arXiv:2207.03957v1 [cs.LG])
    Inspired by a recent breakthrough of Mishchenko et al (2022), who for the first time showed that local gradient steps can lead to provable communication acceleration, we propose an alternative algorithm which obtains the same communication acceleration as their method (ProxSkip). Our approach is very different, however: it is based on the celebrated method of Chambolle and Pock (2011), with several nontrivial modifications: i) we allow for an inexact computation of the prox operator of a certain smooth strongly convex function via a suitable gradient-based method (e.g., GD, Fast GD or FSFOM), ii) we perform a careful modification of the dual update step in order to retain linear convergence. Our general results offer the new state-of-the-art rates for the class of strongly convex-concave saddle-point problems with bilinear coupling characterized by the absence of smoothness in the dual function. When applied to federated learning, we obtain a theoretically better alternative to ProxSkip: our method requires fewer local steps ($O(\kappa^{1/3})$ or $O(\kappa^{1/4})$, compared to $O(\kappa^{1/2})$ of ProxSkip), and performs a deterministic number of local steps instead. Like ProxSkip, our method can be applied to optimization over a connected network, and we obtain theoretical improvements here as well.  ( 3 min )
    Predicting Opinion Dynamics via Sociologically-Informed Neural Networks. (arXiv:2207.03990v1 [cs.SI])
    Opinion formation and propagation are crucial phenomena in social networks and have been extensively studied across several disciplines. Traditionally, theoretical models of opinion dynamics have been proposed to describe the interactions between individuals (i.e., social interaction) and their impact on the evolution of collective opinions. Although these models can incorporate sociological and psychological knowledge on the mechanisms of social interaction, they demand extensive calibration with real data to make reliable predictions, requiring much time and effort. Recently, the widespread use of social media platforms provides new paradigms to learn deep learning models from a large volume of social media data. However, these methods ignore any scientific knowledge about the mechanism of social interaction. In this work, we present the first hybrid method called Sociologically-Informed Neural Network (SINN), which integrates theoretical models and social media data by transporting the concepts of physics-informed neural networks (PINNs) from natural science (i.e., physics) into social science (i.e., sociology and social psychology). In particular, we recast theoretical models as ordinary differential equations (ODEs). Then we train a neural network that simultaneously approximates the data and conforms to the ODEs that represent the social scientific knowledge. In addition, we extend PINNs by integrating matrix factorization and a language model to incorporate rich side information (e.g., user profiles) and structural knowledge (e.g., cluster structure of the social interaction network). Moreover, we develop an end-to-end training procedure for SINN, which involves Gumbel-Softmax approximation to include stochastic mechanisms of social interaction. Extensive experiments on real-world and synthetic datasets show SINN outperforms six baseline methods in predicting opinion dynamics.  ( 3 min )
    A law of adversarial risk, interpolation, and label noise. (arXiv:2207.03933v1 [stat.ML])
    In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy under many circumstances. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the dependence of label noise and adversarial risk in terms of the data distribution. Our results are almost sharp without accounting for the inductive bias of the learning algorithm. We also show that inductive bias makes the effect of label noise much stronger.  ( 2 min )
    Active Learning-based Isolation Forest (ALIF): Enhancing Anomaly Detection in Decision Support Systems. (arXiv:2207.03934v1 [cs.LG])
    The detection of anomalous behaviours is an emerging need in many applications, particularly in contexts where security and reliability are critical aspects. While the definition of anomaly strictly depends on the domain framework, it is often impractical or too time consuming to obtain a fully labelled dataset. The use of unsupervised models to overcome the lack of labels often fails to catch domain specific anomalies as they rely on general definitions of outlier. This paper suggests a new active learning based approach, ALIF, to solve this problem by reducing the number of required labels and tuning the detector towards the definition of anomaly provided by the user. The proposed approach is particularly appealing in the presence of a Decision Support System (DSS), a case that is increasingly popular in real-world scenarios. While it is common that DSS embedded with anomaly detection capabilities rely on unsupervised models, they don't have a way to improve their performance: ALIF is able to enhance the capabilities of DSS by exploiting the user feedback during common operations. ALIF is a lightweight modification of the popular Isolation Forest that proved superior performances with respect to other state-of-art algorithms in a multitude of real anomaly detection datasets.  ( 2 min )
    Generalization-Memorization Machines. (arXiv:2207.03976v1 [cs.LG])
    Classifying the training data correctly without over-fitting is one of the goals in machine learning. In this paper, we propose a generalization-memorization mechanism, including a generalization-memorization decision and a memory modeling principle. Under this mechanism, error-based learning machines improve their memorization abilities of training data without over-fitting. Specifically, the generalization-memorization machines (GMM) are proposed by applying this mechanism. The optimization problems in GMM are quadratic programming problems and could be solved efficiently. It should be noted that the recently proposed generalization-memorization kernel and the corresponding support vector machines are the special cases of our GMM. Experimental results show the effectiveness of the proposed GMM both on memorization and generalization.  ( 2 min )
    Black and Gray Box Learning of Amplitude Equations: Application to Phase Field Systems. (arXiv:2207.03954v1 [stat.ML])
    We present a data-driven approach to learning surrogate models for amplitude equations, and illustrate its application to interfacial dynamics of phase field systems. In particular, we demonstrate learning effective partial differential equations describing the evolution of phase field interfaces from full phase field data. We illustrate this on a model phase field system, where analytical approximate equations for the dynamics of the phase field interface (a higher order eikonal equation and its approximation, the Kardar-Parisi-Zhang (KPZ) equation) are known. For this system, we discuss data-driven approaches for the identification of equations that accurately describe the front interface dynamics. When the analytical approximate models mentioned above become inaccurate, as we move beyond the region of validity of the underlying assumptions, the data-driven equations outperform them. In these regimes, going beyond black-box identification, we explore different approaches to learn data-driven corrections to the analytically approximate models, leading to effective gray box partial differential equations.  ( 2 min )
    Learning with Muscles: Benefits for Data-Efficiency and Robustness in Anthropomorphic Tasks. (arXiv:2207.03952v1 [cs.RO])
    Humans are able to outperform robots in terms of robustness, versatility, and learning of new tasks in a wide variety of movements. We hypothesize that highly nonlinear muscle dynamics play a large role in providing inherent stability, which is favorable to learning. While recent advances have been made in applying modern learning techniques to muscle-actuated systems both in simulation as well as in robotics, so far, no detailed analysis has been performed to show the benefits of muscles in this setting. Our study closes this gap by investigating core robotics challenges and comparing the performance of different actuator morphologies in terms of data-efficiency, hyperparameter sensitivity, and robustness.  ( 2 min )
    High Performance Simulation for Scalable Multi-Agent Reinforcement Learning. (arXiv:2207.03945v1 [cs.MA])
    Multi-agent reinforcement learning experiments and open-source training environments are typically limited in scale, supporting tens or sometimes up to hundreds of interacting agents. In this paper we demonstrate the use of Vogue, a high performance agent based model (ABM) framework. Vogue serves as a multi-agent training environment, supporting thousands to tens of thousands of interacting agents while maintaining high training throughput by running both the environment and reinforcement learning (RL) agents on the GPU. High performance multi-agent environments at this scale have the potential to enable the learning of robust and flexible policies for use in ABMs and simulations of complex systems. We demonstrate training performance with two newly developed, large scale multi-agent training environments. Moreover, we show that these environments can train shared RL policies on time-scales of minutes and hours.  ( 2 min )
    ControlBurn: Nonlinear Feature Selection with Sparse Tree Ensembles. (arXiv:2207.03935v1 [stat.ML])
    ControlBurn is a Python package to construct feature-sparse tree ensembles that support nonlinear feature selection and interpretable machine learning. The algorithms in this package first build large tree ensembles that prioritize basis functions with few features and then select a feature-sparse subset of these basis functions using a weighted lasso optimization criterion. The package includes visualizations to analyze the features selected by the ensemble and their impact on predictions. Hence ControlBurn offers the accuracy and flexibility of tree-ensemble models and the interpretability of sparse generalized additive models. ControlBurn is scalable and flexible: for example, it can use warm-start continuation to compute the regularization path (prediction error for any number of selected features) for a dataset with tens of thousands of samples and hundreds of features in seconds. For larger datasets, the runtime scales linearly in the number of samples and features (up to a log factor), and the package support acceleration using sketching. Moreover, the ControlBurn framework accommodates feature costs, feature groupings, and $\ell_0$-based regularizers. The package is user-friendly and open-source: its documentation and source code appear on https://pypi.org/project/ControlBurn/ and https://github.com/udellgroup/controlburn/.  ( 2 min )
    Memory-free Online Change-point Detection: A Novel Neural Network Approach. (arXiv:2207.03932v1 [cs.LG])
    Change-point detection (CPD), which detects abrupt changes in the data distribution, is recognized as one of the most significant tasks in time series analysis. Despite the extensive literature on offline CPD, unsupervised online CPD still suffers from major challenges, including scalability, hyperparameter tuning, and learning constraints. To mitigate some of these challenges, in this paper, we propose a novel deep learning approach for unsupervised online CPD from multi-dimensional time series, named Adaptive LSTM-Autoencoder Change-Point Detection (ALACPD). ALACPD exploits an LSTM-autoencoder-based neural network to perform unsupervised online CPD. It continuously adapts to the incoming samples without keeping the previously received input, thus being memory-free. We perform an extensive evaluation on several real-world time series CPD benchmarks. We show that ALACPD, on average, ranks first among state-of-the-art CPD algorithms in terms of quality of the time series segmentation, and it is on par with the best performer in terms of the accuracy of the estimated change-points. The implementation of ALACPD is available online on Github\footnote{\url{https://github.com/zahraatashgahi/ALACPD}}.  ( 2 min )
    Generative Adversarial Networks and Other Generative Models. (arXiv:2207.03887v1 [cs.CV])
    Generative networks are fundamentally different in their aim and methods compared to CNNs for classification, segmentation, or object detection. They have initially not been meant to be an image analysis tool, but to produce naturally looking images. The adversarial training paradigm has been proposed to stabilize generative methods, and has proven to be highly successful -- though by no means from the first attempt. This chapter gives a basic introduction into the motivation for Generative Adversarial Networks (GANs) and traces the path of their success by abstracting the basic task and working mechanism, and deriving the difficulty of early practical approaches. Methods for a more stable training will be shown, and also typical signs for poor convergence and their reasons. Though this chapter focuses on GANs that are meant for image generation and image analysis, the adversarial training paradigm itself is not specific to images, and also generalizes to tasks in image analysis. Examples of architectures for image semantic segmentation and abnormality detection will be acclaimed, before contrasting GANs with further generative modeling approaches lately entering the scene. This will allow a contextualized view on the limits but also benefits of GANs.  ( 2 min )
    Storehouse: a Reinforcement Learning Environment for Optimizing Warehouse Management. (arXiv:2207.03851v1 [cs.LG])
    Warehouse Management Systems have been evolving and improving thanks to new Data Intelligence techniques. However, many current optimizations have been applied to specific cases or are in great need of manual interaction. Here is where Reinforcement Learning techniques come into play, providing automatization and adaptability to current optimization policies. In this paper, we present Storehouse, a customizable environment that generalizes the definition of warehouse simulations for Reinforcement Learning. We also validate this environment against state-of-the-art reinforcement learning algorithms and compare these results to human and random policies.  ( 2 min )
    Towards Semantic Communication Protocols: A Probabilistic Logic Perspective. (arXiv:2207.03920v1 [cs.IT])
    Classical medium access control (MAC) protocols are interpretable, yet their task-agnostic control signaling messages (CMs) are ill-suited for emerging mission-critical applications. By contrast, neural network (NN) based protocol models (NPMs) learn to generate task-specific CMs, but their rationale and impact lack interpretability. To fill this void, in this article we propose, for the first time, a semantic protocol model (SPM) constructed by transforming an NPM into an interpretable symbolic graph written in the probabilistic logic programming language (ProbLog). This transformation is viable by extracting and merging common CMs and their connections while treating the NPM as a CM generator. By extensive simulations, we corroborate that the SPM tightly approximates its original NPM while occupying only 0.02% memory. By leveraging its interpretability and memory-efficiency, we demonstrate several SPM-enabled applications such as SPM reconfiguration for collision-avoidance, as well as comparing different SPMs via semantic entropy calculation and storing multiple SPMs to cope with non-stationary environments.  ( 2 min )
    Constrained Training of Neural Networks via Theorem Proving. (arXiv:2207.03880v1 [cs.AI])
    We introduce a theorem proving approach to the specification and generation of temporal logical constraints for training neural networks. We formalise a deep embedding of linear temporal logic over finite traces (LTL$_f$) and an associated evaluation function characterising its semantics within the higher-order logic of the Isabelle theorem prover. We then proceed to formalise a loss function $\mathcal{L}$ that we formally prove to be sound, and differentiable to a function $d\mathcal{L}$. We subsequently use Isabelle's automatic code generation mechanism to produce OCaml versions of LTL$_f$, $\mathcal{L}$ and $d\mathcal{L}$ that we integrate with PyTorch via OCaml bindings for Python. We show that, when used for training in an existing deep learning framework for dynamic movement, our approach produces expected results for common movement specification patterns such as obstacle avoidance and patrolling. The distinctive benefit of our approach is the fully rigorous method for constrained training, eliminating many of the risks inherent to ad-hoc implementations of logical aspects directly in an "unsafe" programming language such as Python.  ( 2 min )
    Ensemble random forest filter: An alternative to the ensemble Kalman filter for inverse modeling. (arXiv:2207.03909v1 [cs.LG])
    The ensemble random forest filter (ERFF) is presented as an alternative to the ensemble Kalman filter (EnKF) for the purpose of inverse modeling. The EnKF is a data assimilation approach that forecasts and updates parameter estimates sequentially in time as observations are being collected. The updating step is based on the experimental covariances computed from an ensemble of realizations and the updates are given as linear combinations of the differences between observations and forecasted system state values. The ERFF replaces the linear combination in the update step with a non-linear function represented by a random forest. In this way, the non-linear relationships between the parameters to be updated and the observations can be captured and a better update produced. The ERFF is demonstrated for the purpose of log-conductivity identification from piezometric head observations in a number of scenarios with varying degrees of heterogeneity (log-conductivity variances going from 1 up to 6.25 (ln m/d)2), number of realizations in the ensemble (50 or 100), and number of piezometric head observations (18 or 36). In all scenarios, the ERFF works well, being able to reconstruct the log-conductivity spatial heterogeneity while matching the observed piezometric heads at selected control points. For benchmarking purposes the ERFF is compared to the restart EnKF to find that the ERFF is superior to the EnKF for the number of ensemble realizations used (small in typical EnKF applications). Only when the number of realizations grows to 500, the restart EnKF is able to match the performance of the ERFF, albeit at triple the computational cost.  ( 3 min )
    GT4SD: Generative Toolkit for Scientific Discovery. (arXiv:2207.03928v1 [cs.LG])
    With the growing availability of data within various scientific domains, generative models hold enormous potential to accelerate scientific discovery at every step of the scientific method. Perhaps their most valuable application lies in the speeding up of what has traditionally been the slowest and most challenging step of coming up with a hypothesis. Powerful representations are now being learned from large volumes of data to generate novel hypotheses, which is making a big impact on scientific discovery applications ranging from material design to drug discovery. The GT4SD (https://github.com/GT4SD/gt4sd-core) is an extensible open-source library that enables scientists, developers and researchers to train and use state-of-the-art generative models for hypothesis generation in scientific discovery. GT4SD supports a variety of uses of generative models across material science and drug discovery, including molecule discovery and design based on properties related to target proteins, omic profiles, scaffold distances, binding energies and more.  ( 2 min )
    Interaction Pattern Disentangling for Multi-Agent Reinforcement Learning. (arXiv:2207.03902v1 [cs.LG])
    Deep cooperative multi-agent reinforcement learning has demonstrated its remarkable success over a wide spectrum of complex control tasks. However, recent advances in multi-agent learning mainly focus on value decomposition while leaving entity interactions still intertwined, which easily leads to over-fitting on noisy interactions between entities. In this work, we introduce a novel interactiOn Pattern disenTangling (OPT) method, to disentangle not only the joint value function into agent-wise value functions for decentralized execution, but also the entity interactions into interaction prototypes, each of which represents an underlying interaction pattern within a sub-group of the entities. OPT facilitates filtering the noisy interactions between irrelevant entities and thus significantly improves generalizability as well as interpretability. Specifically, OPT introduces a sparse disagreement mechanism to encourage sparsity and diversity among discovered interaction prototypes. Then the model selectively restructures these prototypes into a compact interaction pattern by an aggregator with learnable weights. To alleviate the training instability issue caused by partial observability, we propose to maximize the mutual information between the aggregation weights and the history behaviors of each agent. Experiments on both single-task and multi-task benchmarks demonstrate that the proposed method yields results superior to the state-of-the-art counterparts. Our code will be made publicly available.  ( 2 min )
    UDRN: Unified Dimensional Reduction Neural Network for Feature Selection and Feature Projection. (arXiv:2207.03809v1 [cs.LG])
    Dimensional reduction~(DR) maps high-dimensional data into a lower dimensions latent space with minimized defined optimization objectives. The DR method usually falls into feature selection~(FS) and feature projection~(FP). FS focuses on selecting a critical subset of dimensions but risks destroying the data distribution (structure). On the other hand, FP combines all the input features into lower dimensions space, aiming to maintain the data structure; but lacks interpretability and sparsity. FS and FP are traditionally incompatible categories; thus, they have not been unified into an amicable framework. We propose that the ideal DR approach combines both FS and FP into a unified end-to-end manifold learning framework, simultaneously performing fundamental feature discovery while maintaining the intrinsic relationships between data samples in the latent space. In this work, we develop a unified framework, Unified Dimensional Reduction Neural-network~(UDRN), that integrates FS and FP in a compatible, end-to-end way. We improve the neural network structure by implementing FS and FP tasks separately using two stacked sub-networks. In addition, we designed data augmentation of the DR process to improve the generalization ability of the method when dealing with extensive feature datasets and designed loss functions that can cooperate with the data augmentation. Extensive experimental results on four image and four biological datasets, including very high-dimensional data, demonstrate the advantages of DRN over existing methods~(FS, FP, and FS\&FP pipeline), especially in downstream tasks such as classification and visualization.  ( 3 min )
    NExG: Provable and Guided State Space Exploration of Neural Network Control Systems using Sensitivity Approximation. (arXiv:2207.03884v1 [eess.SY])
    We propose a new technique for performing state space exploration of closed loop control systems with neural network feedback controllers. Our approach involves approximating the sensitivity of the trajectories of the closed loop dynamics. Using such an approximator and the system simulator, we present a guided state space exploration method that can generate trajectories visiting the neighborhood of a target state at a specified time. We present a theoretical framework which establishes that our method will produce a sequence of trajectories that will reach a suitable neighborhood of the target state. We provide thorough evaluation of our approach on various systems with neural network feedback controllers of different configurations. We outperform earlier state space exploration techniques and achieve significant improvement in both the quality (explainability) and performance (convergence rate). Finally, we adopt our algorithm for the falsification of a class of temporal logic specification, assess its performance against a state-of-the-art falsification tool, and show its potential in supplementing existing falsification algorithms.  ( 2 min )
    BAST: Binaural Audio Spectrogram Transformer for Binaural Sound Localization. (arXiv:2207.03927v1 [cs.SD])
    Accurate sound localization in a reverberation environment is essential for human auditory perception. Recently, Convolutional Neural Networks (CNNs) have been utilized to model the binaural human auditory pathway. However, CNN shows barriers in capturing the global acoustic features. To address this issue, we propose a novel end-to-end Binaural Audio Spectrogram Transformer (BAST) model to predict the sound azimuth in both anechoic and reverberation environments. Two modes of implementation, i.e. BAST-SP and BAST-NSP corresponding to BAST model with shared and non-shared parameters respectively, are explored. Our model with subtraction interaural integration and hybrid loss achieves an angular distance of 1.29 degrees and a Mean Square Error of 1e-3 at all azimuths, significantly surpassing CNN based model. The exploratory analysis of the BAST's performance on the left-right hemifields and anechoic and reverberation environments shows its generalization ability as well as the feasibility of binaural Transformers in sound localization. Furthermore, the analysis of the attention maps is provided to give additional insights on the interpretation of the localization process in a natural reverberant environment.  ( 2 min )
    Tightening Discretization-based MILP Models for the Pooling Problem using Upper Bounds on Bilinear Terms. (arXiv:2207.03699v1 [math.OC])
    Discretization-based methods have been proposed for solving nonconvex optimization problems with bilinear terms. These methods convert the original nonconvex optimization problems into mixed-integer linear programs (MILPs). Compared to a wide range of studies related to methods to convert nonconvex optimization problems into MILPs, research on tightening the resulting MILP models is limited. In this paper, we present tightening constraints for the discretization-based MILP models for the pooling problem. Specifically, we study tightening constraints derived from upper bounds on bilinear term and exploiting the structures resulting from the discretization. We demonstrate the effectiveness of our constraints, showing computational results for MILP models derived from different formulations for (1) the pooling problem and (2) discretization-based pooling models. Computational results show that our methods reduce the computational time for MILP models on CPLEX 12.10. Finally, we note that while our methods are presented in the context of the pooling problem, they can be extended to address other nonconvex optimization problems with upper bounds on bilinear terms.  ( 2 min )
    Big Learning: A Universal Machine Learning Paradigm?. (arXiv:2207.03899v1 [cs.LG])
    Recent breakthroughs based on big/foundation models reveal a vague avenue for artificial intelligence, that is, bid data, big/foundation models, big learning, $\cdots$. Following that avenue, here we elaborate on the newly introduced big learning. Specifically, big learning comprehensively exploits the available information inherent in large-scale complete/incomplete data, by simultaneously learning to model many-to-all joint/conditional/marginal data distributions (thus named big learning) with one universal foundation model. We reveal that big learning is what existing foundation models are implicitly doing; accordingly, our big learning provides high-level guidance for flexible design and improvements of foundation models, accelerating the true self-learning on the Internet. Besides, big learning ($i$) is equipped with marvelous flexibility for both training data and training-task customization; ($ii$) potentially delivers all joint/conditional/marginal data capabilities after training; ($iii$) significantly reduces the training-test gap with improved model generalization; and ($iv$) unifies conventional machine learning paradigms e.g. supervised learning, unsupervised learning, generative learning, etc. and enables their flexible cooperation, manifesting a universal learning paradigm.  ( 2 min )
    Product Segmentation Newsvendor Problems: A Robust Learning Approach. (arXiv:2207.03801v1 [cs.LG])
    We propose and analyze a product segmentation newsvendor problem, which generalizes the phenomenon of segmentation sales of a class of perishable items. The product segmentation newsvendor problem is a new variant of the newsvendor problem, reflecting that sellers maximize profits by determining the inventory of the whole item in the context of uncertain demand for sub-items. We derive the closed-form robust ordering decision by assuming that the means and covariance matrix of stochastic demand are available but not the distributions. However, robust approaches that always trade-off in the worst-case demand scenario face a concern in solution conservatism; thus, the traditional robust schemes offer unsatisfactory. In this paper, we integrate robust and deep reinforcement learning (DRL) techniques and propose a new paradigm termed robust learning to increase the attractiveness of robust policies. Notably, we take the robust decision as human domain knowledge and implement it into the training process of DRL by designing a full-process human-machine collaborative mechanism of teaching experience, normative decision, and regularization return. Simulation results confirm that our approach effectively improves robust performance and can generalize to various problems that require robust but less conservative solutions. Simultaneously, fewer training episodes, increased training stability, and interpretability of behavior may have the opportunity to facilitate the deployment of DRL algorithms in operational practice. Furthermore, the successful attempt of RLDQN to solve the 1000-dimensional demand scenarios reveals that the algorithm provides a path to solve complex operational problems through human-machine collaboration and may have potential significance for solving other complex operational management problems.  ( 3 min )
    Variational Inference of overparameterized Bayesian Neural Networks: a theoretical and empirical study. (arXiv:2207.03859v1 [stat.ML])
    This paper studies the Variational Inference (VI) used for training Bayesian Neural Networks (BNN) in the overparameterized regime, i.e., when the number of neurons tends to infinity. More specifically, we consider overparameterized two-layer BNN and point out a critical issue in the mean-field VI training. This problem arises from the decomposition of the lower bound on the evidence (ELBO) into two terms: one corresponding to the likelihood function of the model and the second to the Kullback-Leibler (KL) divergence between the prior distribution and the variational posterior. In particular, we show both theoretically and empirically that there is a trade-off between these two terms in the overparameterized regime only when the KL is appropriately re-scaled with respect to the ratio between the the number of observations and neurons. We also illustrate our theoretical results with numerical experiments that highlight the critical choice of this ratio.  ( 2 min )
    Encoding NetFlows for State-Machine Learning. (arXiv:2207.03890v1 [cs.LG])
    NetFlow data is a well-known network log format used by many network analysts and researchers. The advantages of using this format compared to pcap are that it contains fewer data, is less privacy intrusive, and is easier to collect and process. However, having less data does mean that this format might not be able to capture important network behaviour as all information is summarised into statistics. Much research aims to overcome this disadvantage through the use of machine learning, for instance, to detect attacks within a network. Many approaches can be used to pre-process the NetFlow data before it is used to train the machine learning algorithms. However, many of these approaches simply apply existing methods to the data, not considering the specific properties of network data. We argue that for data originating from software systems, such as NetFlow or software logs, similarities in frequency and contexts of feature values are more important than similarities in the value itself. In this work, we, therefore, propose an encoding algorithm that directly takes the frequency and the context of the feature values into account when the data is being processed. Different types of network behaviours can be clustered using this encoding, thus aiding the process of detecting anomalies within the network. From windows of these clusters obtained from monitoring a clean system, we learn state machine behavioural models for anomaly detection. These models are very well-suited to modelling the cyclic and repetitive patterns present in NetFlow data. We evaluate our encoding on a new dataset that we created for detecting problems in Kubernetes clusters and on two well-known public NetFlow datasets. The obtained performance results of the state machine models are comparable to existing works that use many more features and require both clean and infected data as training input.  ( 3 min )
    Convolutional Neural Networks for Time-dependent Classification of Variable-length Time Series. (arXiv:2207.03718v1 [cs.LG])
    Time series data are often obtained only within a limited time range due to interruptions during observation process. To classify such partial time series, we need to account for 1) the variable-length data drawn from 2) different timestamps. To address the first problem, existing convolutional neural networks use global pooling after convolutional layers to cancel the length differences. This architecture suffers from the trade-off between incorporating entire temporal correlations in long data and avoiding feature collapse for short data. To resolve this tradeoff, we propose Adaptive Multi-scale Pooling, which aggregates features from an adaptive number of layers, i.e., only the first few layers for short data and more layers for long data. Furthermore, to address the second problem, we introduce Temporal Encoding, which embeds the observation timestamps into the intermediate features. Experiments on our private dataset and the UCR/UEA time series archive show that our modules improve classification accuracy especially on short data obtained as partial time series.  ( 2 min )
    A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning. (arXiv:2207.03784v1 [cs.LG])
    Proxy-based Deep Metric Learning (DML) learns deep representations by embedding images close to their class representatives (proxies), commonly with respect to the angle between them. However, this disregards the embedding norm, which can carry additional beneficial context such as class- or image-intrinsic uncertainty. In addition, proxy-based DML struggles to learn class-internal structures. To address both issues at once, we introduce non-isotropic probabilistic proxy-based DML. We model images as directional von Mises-Fisher (vMF) distributions on the hypersphere that can reflect image-intrinsic uncertainties. Further, we derive non-isotropic von Mises-Fisher (nivMF) distributions for class proxies to better represent complex class-specific variances. To measure the proxy-to-image distance between these models, we develop and investigate multiple distribution-to-point and distribution-to-distribution metrics. Each framework choice is motivated by a set of ablational studies, which showcase beneficial properties of our probabilistic approach to proxy-based DML, such as uncertainty-awareness, better-behaved gradients during training, and overall improved generalization performance. The latter is especially reflected in the competitive performance on the standard DML benchmarks, where our approach compares favorably, suggesting that existing proxy-based DML can significantly benefit from a more probabilistic treatment. Code is available at github.com/ExplainableML/Probabilistic_Deep_Metric_Learning.  ( 2 min )
    Combining Deep Learning with Good Old-Fashioned Machine Learning. (arXiv:2207.03757v1 [cs.LG])
    We present a comprehensive, stacking-based framework for combining deep learning with good old-fashioned machine learning, called Deep GOld. Our framework involves ensemble selection from 51 retrained pretrained deep networks as first-level models, and 10 machine-learning algorithms as second-level models. Enabled by today's state-of-the-art software tools and hardware platforms, Deep GOld delivers consistent improvement when tested on four image-classification datasets: Fashion MNIST, CIFAR10, CIFAR100, and Tiny ImageNet. Of 120 experiments, in all but 10 Deep GOld improved the original networks' performance.  ( 2 min )
    Safe reinforcement learning for multi-energy management systems with known constraint functions. (arXiv:2207.03830v1 [eess.SY])
    Reinforcement learning (RL) is a promising optimal control technique for multi-energy management systems. It does not require a model a priori - reducing the upfront and ongoing project-specific engineering effort and is capable of learning better representations of the underlying system dynamics. However, vanilla RL does not provide constraint satisfaction guarantees - resulting in various unsafe interactions within its safety-critical environment. In this paper, we present two novel safe RL methods, namely SafeFallback and GiveSafe, where the safety constraint formulation is decoupled from the RL formulation and which provides hard-constraint satisfaction guarantees both during training (exploration) and exploitation of the (close-to) optimal policy. In a simulated multi-energy systems case study we have shown that both methods start with a significantly higher utility (i.e. useful policy) compared to a vanilla RL benchmark (94,6% and 82,8% compared to 35,5%) and that the proposed SafeFallback method even can outperform the vanilla RL benchmark (102,9% to 100%). We conclude that both methods are viably safety constraint handling techniques capable beyond RL, as demonstrated with random agents while still providing hard-constraint guarantees. Finally, we propose fundamental future work to i.a. improve the constraint functions itself as more data becomes available.  ( 3 min )
    On the Subspace Structure of Gradient-Based Meta-Learning. (arXiv:2207.03804v1 [cs.LG])
    In this work we provide an analysis of the distribution of the post-adaptation parameters of Gradient-Based Meta-Learning (GBML) methods. Previous work has noticed how, for the case of image-classification, this adaption only takes place on the last layers of the network. We propose the more general notion that parameters are updated over a low-dimensional \emph{subspace} of the same dimensionality as the task-space and show that this holds for regression as well. Furthermore, the induced subspace structure provides a method to estimate the intrinsic dimension of the space of tasks of common few-shot learning datasets.  ( 2 min )
    A Survey on Participant Selection for Federated Learning in Mobile Networks. (arXiv:2207.03681v1 [cs.DC])
    Federated Learning (FL) is an efficient distributed machine learning paradigm that employs private datasets in a privacy-preserving manner. The main challenges of FL is that end devices usually possess various computation and communication capabilities and their training data are not independent and identically distributed (non-IID). Due to limited communication bandwidth and unstable availability of such devices in a mobile network, only a fraction of end devices (also referred to as the participants or clients in a FL process) can be selected in each round. Hence, it is of paramount importance to utilize an efficient participant selection scheme to maximize the performance of FL including final model accuracy and training time. In this paper, we provide a review of participant selection techniques for FL. First, we introduce FL and highlight the main challenges during participant selection. Then, we review the existing studies and categorize them based on their solutions. Finally, we provide some future directions on participant selection for FL based on our analysis of the state-of-the-art in this topic area.  ( 2 min )
    Private independence testing across two parties. (arXiv:2207.03652v1 [math.ST])
    We introduce $\pi$-test, a privacy-preserving algorithm for testing statistical independence between data distributed across multiple parties. Our algorithm relies on privately estimating the distance correlation between datasets, a quantitative measure of independence introduced in Sz\'ekely et al. [2007]. We establish both additive and multiplicative error bounds on the utility of our differentially private test, which we believe will find applications in a variety of distributed hypothesis testing settings involving sensitive data.  ( 2 min )
    Stability of Aggregation Graph Neural Networks. (arXiv:2207.03678v1 [cs.LG])
    In this paper we study the stability properties of aggregation graph neural networks (Agg-GNNs) considering perturbations of the underlying graph. An Agg-GNN is a hybrid architecture where information is defined on the nodes of a graph, but it is processed block-wise by Euclidean CNNs on the nodes after several diffusions on the graph shift operator. We derive stability bounds for the mapping operator associated to a generic Agg-GNN, and we specify conditions under which such operators can be stable to deformations. We prove that the stability bounds are defined by the properties of the filters in the first layer of the CNN that acts on each node. Additionally, we show that there is a close relationship between the number of aggregations, the filter's selectivity, and the size of the stability constants. We also conclude that in Agg-GNNs the selectivity of the mapping operators is tied to the properties of the filters only in the first layer of the CNN stage. This shows a substantial difference with respect to the stability properties of selection GNNs, where the selectivity of the filters in all layers is constrained by their stability. We provide numerical evidence corroborating the results derived, testing the behavior of Agg-GNNs in real life application scenarios considering perturbations of different magnitude.  ( 2 min )
    Tackling Data Heterogeneity: A New Unified Framework for Decentralized SGD with Sample-induced Topology. (arXiv:2207.03730v1 [math.OC])
    We develop a general framework unifying several gradient-based stochastic optimization methods for empirical risk minimization problems both in centralized and distributed scenarios. The framework hinges on the introduction of an augmented graph consisting of nodes modeling the samples and edges modeling both the inter-device communication and intra-device stochastic gradient computation. By designing properly the topology of the augmented graph, we are able to recover as special cases the renowned Local-SGD and DSGD algorithms, and provide a unified perspective for variance-reduction (VR) and gradient-tracking (GT) methods such as SAGA, Local-SVRG and GT-SAGA. We also provide a unified convergence analysis for smooth and (strongly) convex objectives relying on a proper structured Lyapunov function, and the obtained rate can recover the best known results for many existing algorithms. The rate results further reveal that VR and GT methods can effectively eliminate data heterogeneity within and across devices, respectively, enabling the exact convergence of the algorithm to the optimal solution. Numerical experiments confirm the findings in this paper.  ( 2 min )
    Predicting Li-ion Battery Cycle Life with LSTM RNN. (arXiv:2207.03687v1 [cs.LG])
    Efficient and accurate remaining useful life prediction is a key factor for reliable and safe usage of lithium-ion batteries. This work trains a long short-term memory recurrent neural network model to learn from sequential data of discharge capacities at various cycles and voltages and to work as a cycle life predictor for battery cells cycled under different conditions. Using experimental data of first 60 - 80 cycles, our model achieves promising prediction accuracy on test sets of around 80 samples.  ( 2 min )
    Deep Learning for Anomaly Detection in Log Data: A Survey. (arXiv:2207.03820v1 [cs.LG])
    Automatic log file analysis enables early detection of relevant incidents such as system failures. In particular, self-learning anomaly detection techniques capture patterns in log data and subsequently report unexpected log event occurrences to system operators without the need to provide or manually model anomalous scenarios in advance. Recently, an increasing number of approaches leveraging deep learning neural networks for this purpose have been presented. These approaches have demonstrated superior detection performance in comparison to conventional machine learning techniques and simultaneously resolve issues with unstable data formats. However, there exist many different architectures for deep learning and it is non-trivial to encode raw and unstructured log data to be analyzed by neural networks. We therefore carry out a systematic literature review that provides an overview of deployed models, data pre-processing mechanisms, anomaly detection techniques, and evaluations. The survey does not quantitatively compare existing approaches but instead aims to help readers understand relevant aspects of different model architectures and emphasizes open issues for future work.  ( 2 min )
    Nonparametric Embeddings of Sparse High-Order Interaction Events. (arXiv:2207.03639v1 [cs.LG])
    High-order interaction events are common in real-world applications. Learning embeddings that encode the complex relationships of the participants from these events is of great importance in knowledge mining and predictive tasks. Despite the success of existing approaches, e.g. Poisson tensor factorization, they ignore the sparse structure underlying the data, namely the occurred interactions are far less than the possible interactions among all the participants. In this paper, we propose Nonparametric Embeddings of Sparse High-order interaction events (NESH). We hybridize a sparse hypergraph (tensor) process and a matrix Gaussian process to capture both the asymptotic structural sparsity within the interactions and nonlinear temporal relationships between the participants. We prove strong asymptotic bounds (including both a lower and an upper bound) of the sparsity ratio, which reveals the asymptotic properties of the sampled structure. We use batch-normalization, stick-breaking construction, and sparse variational GP approximations to develop an efficient, scalable model inference algorithm. We demonstrate the advantage of our approach in several real-world applications.  ( 2 min )
    The Power of Transfer Learning in Agricultural Applications: AgriNet. (arXiv:2207.03881v1 [cs.CV])
    Advances in deep learning and transfer learning have paved the way for various automation classification tasks in agriculture, including plant diseases, pests, weeds, and plant species detection. However, agriculture automation still faces various challenges, such as the limited size of datasets and the absence of plant-domain-specific pretrained models. Domain specific pretrained models have shown state of art performance in various computer vision tasks including face recognition and medical imaging diagnosis. In this paper, we propose AgriNet dataset, a collection of 160k agricultural images from more than 19 geographical locations, several images captioning devices, and more than 423 classes of plant species and diseases. We also introduce AgriNet models, a set of pretrained models on five ImageNet architectures: VGG16, VGG19, Inception-v3, InceptionResNet-v2, and Xception. AgriNet-VGG19 achieved the highest classification accuracy of 94 % and the highest F1-score of 92%. Additionally, all proposed models were found to accurately classify the 423 classes of plant species, diseases, pests, and weeds with a minimum accuracy of 87% for the Inception-v3 model.Finally, experiments to evaluate of superiority of AgriNet models compared to ImageNet models were conducted on two external datasets: pest and plant diseases dataset from Bangladesh and a plant diseases dataset from Kashmir.  ( 2 min )
    End-to-End Binaural Speech Synthesis. (arXiv:2207.03697v1 [cs.SD])
    In this work, we present an end-to-end binaural speech synthesis system that combines a low-bitrate audio codec with a powerful binaural decoder that is capable of accurate speech binauralization while faithfully reconstructing environmental factors like ambient noise or reverb. The network is a modified vector-quantized variational autoencoder, trained with several carefully designed objectives, including an adversarial loss. We evaluate the proposed system on an internal binaural dataset with objective metrics and a perceptual study. Results show that the proposed approach matches the ground truth data more closely than previous methods. In particular, we demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.  ( 2 min )
    GCN-based Multi-task Representation Learning for Anomaly Detection in Attributed Networks. (arXiv:2207.03688v1 [cs.LG])
    Anomaly detection in attributed networks has received a considerable attention in recent years due to its applications in a wide range of domains such as finance, network security, and medicine. Traditional approaches cannot be adopted on attributed networks' settings to solve the problem of anomaly detection. The main limitation of such approaches is that they inherently ignore the relational information between data features. With a rapid explosion in deep learning- and graph neural networks-based techniques, spotting rare objects on attributed networks has significantly stepped forward owing to the potentials of deep techniques in extracting complex relationships. In this paper, we propose a new architecture on anomaly detection. The main goal of designing such an architecture is to utilize multi-task learning which would enhance the detection performance. Multi-task learning-based anomaly detection is still in its infancy and only a few studies in the existing literature have catered to the same. We incorporate both community detection and multi-view representation learning techniques for extracting distinct and complementary information from attributed networks and subsequently fuse the captured information for achieving a better detection result. The mutual collaboration between two main components employed in this architecture, i.e., community-specific learning and multi-view representation learning, exhibits a promising solution to reach more effective results.  ( 3 min )
    Video Dialog as Conversation about Objects Living in Space-Time. (arXiv:2207.03656v1 [cs.CV])
    It would be a technological feat to be able to create a system that can hold a meaningful conversation with humans about what they watch. A setup toward that goal is presented as a video dialog task, where the system is asked to generate natural utterances in response to a question in an ongoing dialog. The task poses great visual, linguistic, and reasoning challenges that cannot be easily overcome without an appropriate representation scheme over video and dialog that supports high-level reasoning. To tackle these challenges we present a new object-centric framework for video dialog that supports neural reasoning dubbed COST - which stands for Conversation about Objects in Space-Time. Here dynamic space-time visual content in videos is first parsed into object trajectories. Given this video abstraction, COST maintains and tracks object-associated dialog states, which are updated upon receiving new questions. Object interactions are dynamically and conditionally inferred for each question, and these serve as the basis for relational reasoning among them. COST also maintains a history of previous answers, and this allows retrieval of relevant object-centric information to enrich the answer forming process. Language production then proceeds in a step-wise manner, taking into the context of the current utterance, the existing dialog, the current question. We evaluate COST on the DSTC7 and DSTC8 benchmarks, demonstrating its competitiveness against state-of-the-arts.  ( 3 min )
    Guiding the retraining of convolutional neural networks against adversarial inputs. (arXiv:2207.03689v1 [cs.SE])
    Background: When using deep learning models, there are many possible vulnerabilities and some of the most worrying are the adversarial inputs, which can cause wrong decisions with minor perturbations. Therefore, it becomes necessary to retrain these models against adversarial inputs, as part of the software testing process addressing the vulnerability to these inputs. Furthermore, for an energy efficient testing and retraining, data scientists need support on which are the best guidance metrics and optimal dataset configurations. Aims: We examined four guidance metrics for retraining convolutional neural networks and three retraining configurations. Our goal is to improve the models against adversarial inputs regarding accuracy, resource utilization and time from the point of view of a data scientist in the context of image classification. Method: We conducted an empirical study in two datasets for image classification. We explore: (a) the accuracy, resource utilization and time of retraining convolutional neural networks by ordering new training set by four different guidance metrics (neuron coverage, likelihood-based surprise adequacy, distance-based surprise adequacy and random), (b) the accuracy and resource utilization of retraining convolutional neural networks with three different configurations (from scratch and augmented dataset, using weights and augmented dataset, and using weights and only adversarial inputs). Results: We reveal that retraining with adversarial inputs from original weights and by ordering with surprise adequacy metrics gives the best model w.r.t. the used metrics. Conclusions: Although more studies are necessary, we recommend data scientists to use the above configuration and metrics to deal with the vulnerability to adversarial inputs of deep learning models, as they can improve their models against adversarial inputs without using many inputs.  ( 3 min )
    Balanced Self-Paced Learning for AUC Maximization. (arXiv:2207.03650v1 [cs.LG])
    Learning to improve AUC performance is an important topic in machine learning. However, AUC maximization algorithms may decrease generalization performance due to the noisy data. Self-paced learning is an effective method for handling noisy data. However, existing self-paced learning methods are limited to pointwise learning, while AUC maximization is a pairwise learning problem. To solve this challenging problem, we innovatively propose a balanced self-paced AUC maximization algorithm (BSPAUC). Specifically, we first provide a statistical objective for self-paced AUC. Based on this, we propose our self-paced AUC maximization formulation, where a novel balanced self-paced regularization term is embedded to ensure that the selected positive and negative samples have proper proportions. Specially, the sub-problem with respect to all weight variables may be non-convex in our formulation, while the one is normally convex in existing self-paced problems. To address this, we propose a doubly cyclic block coordinate descent method. More importantly, we prove that the sub-problem with respect to all weight variables converges to a stationary point on the basis of closed-form solutions, and our BSPAUC converges to a stationary point of our fixed optimization objective under a mild assumption. Considering both the deep learning and kernel-based implementations, experimental results on several large-scale datasets demonstrate that our BSPAUC has a better generalization performance than existing state-of-the-art AUC maximization methods.  ( 2 min )
    Information-Gathering in Latent Bandits. (arXiv:2207.03635v1 [cs.LG])
    In the latent bandit problem, the learner has access to reward distributions and -- for the non-stationary variant -- transition models of the environment. The reward distributions are conditioned on the arm and unknown latent states. The goal is to use the reward history to identify the latent state, allowing for the optimal choice of arms in the future. The latent bandit setting lends itself to many practical applications, such as recommender and decision support systems, where rich data allows the offline estimation of environment models with online learning remaining a critical component. Previous solutions in this setting always choose the highest reward arm according to the agent's beliefs about the state, not explicitly considering the value of information-gathering arms. Such information-gathering arms do not necessarily provide the highest reward, thus may never be chosen by an agent that chooses the highest reward arms at all times. In this paper, we present a method for information-gathering in latent bandits. Given particular reward structures and transition matrices, we show that choosing the best arm given the agent's beliefs about the states incurs higher regret. Furthermore, we show that by choosing arms carefully, we obtain an improved estimation of the state distribution, and thus lower the cumulative regret through better arm choices in the future. We evaluate our method on both synthetic and real-world data sets, showing significant improvement in regret over state-of-the-art methods.  ( 3 min )
    Getting BART to Ride the Idiomatic Train: Learning to Represent Idiomatic Expressions. (arXiv:2207.03679v1 [cs.CL])
    Idiomatic expressions (IEs), characterized by their non-compositionality, are an important part of natural language. They have been a classical challenge to NLP, including pre-trained language models that drive today's state-of-the-art. Prior work has identified deficiencies in their contextualized representation stemming from the underlying compositional paradigm of representation. In this work, we take a first-principles approach to build idiomaticity into BART using an adapter as a lightweight non-compositional language expert trained on idiomatic sentences. The improved capability over baselines (e.g., BART) is seen via intrinsic and extrinsic methods, where idiom embeddings score 0.19 points higher in homogeneity score for embedding clustering, and up to 25% higher sequence accuracy on the idiom processing tasks of IE sense disambiguation and span detection.  ( 2 min )
    Abs-CAM: A Gradient Optimization Interpretable Approach for Explanation of Convolutional Neural Networks. (arXiv:2207.03648v1 [cs.CV])
    The black-box nature of Deep Neural Networks (DNNs) severely hinders its performance improvement and application in specific scenes. In recent years, class activation mapping-based method has been widely used to interpret the internal decisions of models in computer vision tasks. However, when this method uses backpropagation to obtain gradients, it will cause noise in the saliency map, and even locate features that are irrelevant to decisions. In this paper, we propose an Absolute value Class Activation Mapping-based (Abs-CAM) method, which optimizes the gradients derived from the backpropagation and turns all of them into positive gradients to enhance the visual features of output neurons' activation, and improve the localization ability of the saliency map. The framework of Abs-CAM is divided into two phases: generating initial saliency map and generating final saliency map. The first phase improves the localization ability of the saliency map by optimizing the gradient, and the second phase linearly combines the initial saliency map with the original image to enhance the semantic information of the saliency map. We conduct qualitative and quantitative evaluation of the proposed method, including Deletion, Insertion, and Pointing Game. The experimental results show that the Abs-CAM can obviously eliminate the noise in the saliency map, and can better locate the features related to decisions, and is superior to the previous methods in recognition and localization tasks.  ( 3 min )
    SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning. (arXiv:2207.03677v1 [cs.CV])
    Neural architecture search (NAS) has demonstrated amazing success in searching for efficient deep neural networks (DNNs) from a given supernet. In parallel, the lottery ticket hypothesis has shown that DNNs contain small subnetworks that can be trained from scratch to achieve a comparable or higher accuracy than original DNNs. As such, it is currently a common practice to develop efficient DNNs via a pipeline of first search and then prune. Nevertheless, doing so often requires a search-train-prune-retrain process and thus prohibitive computational cost. In this paper, we discover for the first time that both efficient DNNs and their lottery subnetworks (i.e., lottery tickets) can be directly identified from a supernet, which we term as SuperTickets, via a two-in-one training scheme with jointly architecture searching and parameter pruning. Moreover, we develop a progressive and unified SuperTickets identification strategy that allows the connectivity of subnetworks to change during supernet training, achieving better accuracy and efficiency trade-offs than conventional sparse training. Finally, we evaluate whether such identified SuperTickets drawn from one task can transfer well to other tasks, validating their potential of handling multiple tasks simultaneously. Extensive experiments and ablation studies on three tasks and four benchmark datasets validate that our proposed SuperTickets achieve boosted accuracy and efficiency trade-offs than both typical NAS and pruning pipelines, regardless of having retraining or not. Codes and pretrained models are available at https://github.com/RICE-EIC/SuperTickets.  ( 3 min )
    A Support Vector Model of Pruning Trees Evaluation Based on OTSU Algorithm. (arXiv:2207.03638v1 [cs.CV])
    The tree pruning process is the key to promoting fruits' growth and improving their productions due to effects on the photosynthesis efficiency of fruits and nutrition transportation in branches. Currently, pruning is still highly dependent on human labor. The workers' experience will strongly affect the robustness of the performance of the tree pruning. Thus, it is a challenge for workers and farmers to evaluate the pruning performance. Intended for a better solution to the problem, this paper presents a novel pruning classification strategy model called "OTSU-SVM" to evaluate the pruning performance based on the shadows of branches and leaves. This model considers not only the available illuminated area of the tree but also the uniformity of the illuminated area of the tree. More importantly, our group implements OTSU algorithm into the model, which highly reinforces robustness of the evaluation of this model. In addition, the data from the pear trees in the Yuhang District, Hangzhou is also used in the experiment. In this experiment, we prove that the OTSU-SVM has good accuracy with 80% and high performance in the evaluation of the pruning for the pear trees. It can provide more successful pruning if applied into the orchard. A successful pruning can broaden the illuminated area of individual fruit, and increase nutrition transportation from the target branch, dramatically elevating the weights and production of the fruits.  ( 3 min )
    Generalization Guarantee of Training Graph Convolutional Networks with Graph Topology Sampling. (arXiv:2207.03584v1 [cs.LG])
    Graph convolutional networks (GCNs) have recently achieved great empirical success in learning graph-structured data. To address its scalability issue due to the recursive embedding of neighboring features, graph topology sampling has been proposed to reduce the memory and computational cost of training GCNs, and it has achieved comparable test performance to those without topology sampling in many empirical studies. To the best of our knowledge, this paper provides the first theoretical justification of graph topology sampling in training (up to) three-layer GCNs for semi-supervised node classification. We formally characterize some sufficient conditions on graph topology sampling such that GCN training leads to a diminishing generalization error. Moreover, our method tackles the nonconvex interaction of weights across layers, which is under-explored in the existing theoretical analyses of GCNs. This paper characterizes the impact of graph structures and topology sampling on the generalization performance and sample complexity explicitly, and the theoretical findings are also justified through numerical experiments.  ( 2 min )
    Individual Preference Stability for Clustering. (arXiv:2207.03600v1 [cs.LG])
    In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first show that deciding whether a given data set allows for an IP-stable clustering in general is NP-hard. As a result, we explore the design of efficient algorithms for finding IP-stable clusterings in some restricted metric spaces. We present a polytime algorithm to find a clustering satisfying exact IP-stability on the real line, and an efficient algorithm to find an IP-stable 2-clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.  ( 2 min )
    Pruning Early Exit Networks. (arXiv:2207.03644v1 [cs.LG])
    Deep learning models that perform well often have high computational costs. In this paper, we combine two approaches that try to reduce the computational cost while keeping the model performance high: pruning and early exit networks. We evaluate two approaches of pruning early exit networks: (1) pruning the entire network at once, (2) pruning the base network and additional linear classifiers in an ordered fashion. Experimental results show that pruning the entire network at once is a better strategy in general. However, at high accuracy rates, the two approaches have a similar performance, which implies that the processes of pruning and early exit can be separated without loss of optimality.  ( 2 min )
    PoseGU: 3D Human Pose Estimation with Novel Human Pose Generator and Unbiased Learning. (arXiv:2207.03618v1 [cs.CV])
    3D pose estimation has recently gained substantial interests in computer vision domain. Existing 3D pose estimation methods have a strong reliance on large size well-annotated 3D pose datasets, and they suffer poor model generalization on unseen poses due to limited diversity of 3D poses in training sets. In this work, we propose PoseGU, a novel human pose generator that generates diverse poses with access only to a small size of seed samples, while equipping the Counterfactual Risk Minimization to pursue an unbiased evaluation objective. Extensive experiments demonstrate PoseGU outforms almost all the state-of-the-art 3D human pose methods under consideration over three popular benchmark datasets. Empirical analysis also proves PoseGU generates 3D poses with improved data diversity and better generalization ability.  ( 2 min )
    Robustness Evaluation of Deep Unsupervised Learning Algorithms for Intrusion Detection Systems. (arXiv:2207.03576v1 [cs.CR])
    Recently, advances in deep learning have been observed in various fields, including computer vision, natural language processing, and cybersecurity. Machine learning (ML) has demonstrated its ability as a potential tool for anomaly detection-based intrusion detection systems to build secure computer networks. Increasingly, ML approaches are widely adopted than heuristic approaches for cybersecurity because they learn directly from data. Data is critical for the development of ML systems, and becomes potential targets for attackers. Basically, data poisoning or contamination is one of the most common techniques used to fool ML models through data. This paper evaluates the robustness of six recent deep learning algorithms for intrusion detection on contaminated data. Our experiments suggest that the state-of-the-art algorithms used in this study are sensitive to data contamination and reveal the importance of self-defense against data perturbation when developing novel models, especially for intrusion detection systems.  ( 2 min )
    One for All: Simultaneous Metric and Preference Learning over Multiple Users. (arXiv:2207.03609v1 [stat.ML])
    This paper investigates simultaneous preference and metric learning from a crowd of respondents. A set of items represented by $d$-dimensional feature vectors and paired comparisons of the form ``item $i$ is preferable to item $j$'' made by each user is given. Our model jointly learns a distance metric that characterizes the crowd's general measure of item similarities along with a latent ideal point for each user reflecting their individual preferences. This model has the flexibility to capture individual preferences, while enjoying a metric learning sample cost that is amortized over the crowd. We first study this problem in a noiseless, continuous response setting (i.e., responses equal to differences of item distances) to understand the fundamental limits of learning. Next, we establish prediction error guarantees for noisy, binary measurements such as may be collected from human respondents, and show how the sample complexity improves when the underlying metric is low-rank. Finally, we establish recovery guarantees under assumptions on the response distribution. We demonstrate the performance of our model on both simulated data and on a dataset of color preference judgements across a large number of users.  ( 2 min )
    Hyper-Universal Policy Approximation: Learning to Generate Actions from a Single Image using Hypernets. (arXiv:2207.03593v1 [cs.LG])
    Inspired by Gibson's notion of object affordances in human vision, we ask the question: how can an agent learn to predict an entire action policy for a novel object or environment given only a single glimpse? To tackle this problem, we introduce the concept of Universal Policy Functions (UPFs) which are state-to-action mappings that generalize not only to new goals but most importantly to novel, unseen environments. Specifically, we consider the problem of efficiently learning such policies for agents with limited computational and communication capacity, constraints that are frequently encountered in edge devices. We propose the Hyper-Universal Policy Approximator (HUPA), a hypernetwork-based model to generate small task- and environment-conditional policy networks from a single image, with good generalization properties. Our results show that HUPAs significantly outperform an embedding-based alternative for generated policies that are size-constrained. Although this work is restricted to a simple map-based navigation task, future work includes applying the principles behind HUPAs to learning more general affordances for objects and environments.  ( 2 min )
    A Study on the Predictability of Sample Learning Consistency. (arXiv:2207.03571v1 [cs.LG])
    Curriculum Learning is a powerful training method that allows for faster and better training in some settings. This method, however, requires having a notion of which examples are difficult and which are easy, which is not always trivial to provide. A recent metric called C-Score acts as a proxy for example difficulty by relating it to learning consistency. Unfortunately, this method is quite compute intensive which limits its applicability for alternative datasets. In this work, we train models through different methods to predict C-Score for CIFAR-100 and CIFAR-10. We find, however, that these models generalize poorly both within the same distribution as well as out of distribution. This suggests that C-Score is not defined by the individual characteristics of each sample but rather by other factors. We hypothesize that a sample's relation to its neighbours, in particular, how many of them share the same labels, can help in explaining C-Scores. We plan to explore this in future work.  ( 2 min )
    Code Translation with Compiler Representations. (arXiv:2207.03578v1 [cs.PL])
    In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java - Rust pair. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.  ( 2 min )
    Learning-based Autonomous Channel Access in the Presence of Hidden Terminals. (arXiv:2207.03605v1 [cs.LG])
    We consider the problem of autonomous channel access (AutoCA), where a group of terminals tries to discover a communication strategy with an access point (AP) via a common wireless channel in a distributed fashion. Due to the irregular topology and the limited communication range of terminals, a practical challenge for AutoCA is the hidden terminal problem, which is notorious in wireless networks for deteriorating the throughput and delay performances. To meet the challenge, this paper presents a new multi-agent deep reinforcement learning paradigm, dubbed MADRL-HT, tailored for AutoCA in the presence of hidden terminals. MADRL-HT exploits topological insights and transforms the observation space of each terminal into a scalable form independent of the number of terminals. To compensate for the partial observability, we put forth a look-back mechanism such that the terminals can infer behaviors of their hidden terminals from the carrier sensed channel states as well as feedback from the AP. A window-based global reward function is proposed, whereby the terminals are instructed to maximize the system throughput while balancing the terminals' transmission opportunities over the course of learning. Extensive numerical experiments verified the superior performance of our solution benchmarked against the legacy carrier-sense multiple access with collision avoidance (CSMA/CA) protocol.  ( 3 min )
    CausalAgents: A Robustness Benchmark for Motion Forecasting using Causal Relationships. (arXiv:2207.03586v1 [cs.LG])
    As machine learning models become increasingly prevalent in motion forecasting systems for autonomous vehicles (AVs), it is critical that we ensure that model predictions are safe and reliable. However, exhaustively collecting and labeling the data necessary to fully test the long tail of rare and challenging scenarios is difficult and expensive. In this work, we construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. Specifically, we conduct an extensive labeling effort to identify causal agents, or agents whose presence influences human driver behavior in any way, in the Waymo Open Motion Dataset (WOMD), and we use these labels to perturb the data by deleting non-causal agents from the scene. We then evaluate a diverse set of state-of-the-art deep-learning model architectures on our proposed benchmark and find that all models exhibit large shifts under perturbation. Under non-causal perturbations, we observe a $25$-$38\%$ relative change in minADE as compared to the original. We then investigate techniques to improve model robustness, including increasing the training dataset size and using targeted data augmentations that drop agents throughout training. We plan to provide the causal agent labels as an additional attribute to WOMD and release the robustness benchmarks to aid the community in building more reliable and safe deep-learning models for motion forecasting.  ( 3 min )
    Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data. (arXiv:2207.03615v1 [cs.LG])
    This paper analyzes the convergence and generalization of training a one-hidden-layer neural network when the input features follow the Gaussian mixture model consisting of a finite number of Gaussian distributions. Assuming the labels are generated from a teacher model with an unknown ground truth weight, the learning problem is to estimate the underlying teacher model by minimizing a non-convex risk function over a student neural network. With a finite number of training samples, referred to the sample complexity, the iterations are proved to converge linearly to a critical point with guaranteed generalization error. In addition, for the first time, this paper characterizes the impact of the input distributions on the sample complexity and the learning rate.  ( 2 min )
    Automatic Synthesis of Neurons for Recurrent Neural Nets. (arXiv:2207.03577v1 [cs.NE])
    We present a new class of neurons, ARNs, which give a cross entropy on test data that is up to three times lower than the one achieved by carefully optimized LSTM neurons. The explanations for the huge improvements that often are achieved are elaborate skip connections through time, up to four internal memory states per neuron and a number of novel activation functions including small quadratic forms. The new neurons were generated using automatic programming and are formulated as pure functional programs that easily can be transformed. We present experimental results for eight datasets and found excellent improvements for seven of them, but LSTM remained the best for one dataset. The results are so promising that automatic programming to generate new neurons should become part of the standard operating procedure for any machine learning practitioner who works on time series data such as sensor signals.  ( 2 min )
    Demystifying the Adversarial Robustness of Random Transformation Defenses. (arXiv:2207.03574v1 [cs.CR])
    Neural networks' lack of robustness against attacks raises concerns in security-sensitive settings such as autonomous vehicles. While many countermeasures may look promising, only a few withstand rigorous evaluation. Defenses using random transformations (RT) have shown impressive results, particularly BaRT (Raff et al., 2019) on ImageNet. However, this type of defense has not been rigorously evaluated, leaving its robustness properties poorly understood. Their stochastic properties make evaluation more challenging and render many proposed attacks on deterministic models inapplicable. First, we show that the BPDA attack (Athalye et al., 2018a) used in BaRT's evaluation is ineffective and likely overestimates its robustness. We then attempt to construct the strongest possible RT defense through the informed selection of transformations and Bayesian optimization for tuning their parameters. Furthermore, we create the strongest possible attack to evaluate our RT defense. Our new attack vastly outperforms the baseline, reducing the accuracy by 83% compared to the 19% reduction by the commonly used EoT attack ($4.3\times$ improvement). Our result indicates that the RT defense on the Imagenette dataset (a ten-class subset of ImageNet) is not robust against adversarial examples. Extending the study further, we use our new attack to adversarially train RT defense (called AdvRT), resulting in a large robustness gain. Code is available at https://github.com/wagnergroup/demystify-random-transform.  ( 3 min )
    Dynamic Community Detection via Adversarial Temporal Graph Representation Learning. (arXiv:2207.03580v1 [cs.SI])
    Dynamic community detection has been prospered as a powerful tool for quantifying changes in dynamic brain network connectivity patterns by identifying strongly connected sets of nodes. However, as the network science problems and network data to be processed become gradually more sophisticated, it awaits a better method to efficiently learn low dimensional representation from dynamic network data and reveal its latent function that changes over time in the brain network. In this work, an adversarial temporal graph representation learning (ATGRL) framework is proposed to detect dynamic communities from a small sample of brain network data. It adopts a novel temporal graph attention network as an encoder to capture more efficient spatio-temporal features by attention mechanism in both spatial and temporal dimensions. In addition, the framework employs adversarial training to guide the learning of temporal graph representation and optimize the measurable modularity loss to maximize the modularity of community. Experiments on the real-world brain networks datasets are demonstrated to show the effectiveness of this new method.  ( 2 min )
    Convolution Neural Network based Mode Decomposition for Degenerated Modes via Multiple Images from Polarizers. (arXiv:2207.03489v1 [cs.CV])
    In this paper, a mode decomposition (MD) method for degenerated modes has been studied. Convolution neural network (CNN) has been applied for image training and predicting the mode coefficients. Four-fold degenerated $LP_{11}$ series has been the target to be decomposed. Multiple images are regarded as an input to decompose the degenerate modes. Total of seven different images, including the full original near-field image, and images after linear polarizers of four directions (0$^\circ$, 45$^\circ$, 90$^\circ$, and 135$^\circ$), and images after two circular polarizers (right-handed and left-handed) has been considered for training, validation, and test. The output label of the model has been chosen as the real and imaginary components of the mode coefficient, and the loss function has been selected to be the root-mean-square (RMS) of the labels. The RMS and mean-absolute-error (MAE) of the label, intensity, phase, and field correlation between the actual and predicted values have been selected to be the metrics to evaluate the CNN model. The CNN model has been trained with 100,000 three-dimensional images with depths of three, four, and seven. The performance of the trained model was evaluated via 10,000 test samples with four sets of images - images after three linear polarizers (0$^\circ$, 45$^\circ$, 90$^\circ$) and image after right-handed circular polarizer - showed 0.0634 of label RMS, 0.0292 of intensity RMS, 0.1867 rad of phase MAE, and 0.9978 of average field correlation. The performance of 4 image sets showed at least 50.68\% of performance enhancement compared to models considering only images after linear polarizers.  ( 3 min )
    An Embedding-Dynamic Approach to Self-supervised Learning. (arXiv:2207.03552v1 [cs.CV])
    A number of recent self-supervised learning methods have shown impressive performance on image classification and other tasks. A somewhat bewildering variety of techniques have been used, not always with a clear understanding of the reasons for their benefits, especially when used in combination. Here we treat the embeddings of images as point particles and consider model optimization as a dynamic process on this system of particles. Our dynamic model combines an attractive force for similar images, a locally dispersive force to avoid local collapse, and a global dispersive force to achieve a globally-homogeneous distribution of particles. The dynamic perspective highlights the advantage of using a delayed-parameter image embedding (a la BYOL) together with multiple views of the same image. It also uses a purely-dynamic local dispersive force (Brownian motion) that shows improved performance over other methods and does not require knowledge of other particle coordinates. The method is called MSBReg which stands for (i) a Multiview centroid loss, which applies an attractive force to pull different image view embeddings toward their centroid, (ii) a Singular value loss, which pushes the particle system toward spatially homogeneous density, (iii) a Brownian diffusive loss. We evaluate downstream classification performance of MSBReg on ImageNet as well as transfer learning tasks including fine-grained classification, multi-class object classification, object detection, and instance segmentation. In addition, we also show that applying our regularization term to other methods further improves their performance and stabilize the training by preventing a mode collapse.  ( 3 min )
    On Non-Linear operators for Geometric Deep Learning. (arXiv:2207.03485v1 [cs.LG])
    This work studies operators mapping vector and scalar fields defined over a manifold $\mathcal{M}$, and which commute with its group of diffeomorphisms $\text{Diff}(\mathcal{M})$. We prove that in the case of scalar fields $L^p_\omega(\mathcal{M,\mathbb{R}})$, those operators correspond to point-wise non-linearities, recovering and extending known results on $\mathbb{R}^d$. In the context of Neural Networks defined over $\mathcal{M}$, it indicates that point-wise non-linear operators are the only universal family that commutes with any group of symmetries, and justifies their systematic use in combination with dedicated linear operators commuting with specific symmetries. In the case of vector fields $L^p_\omega(\mathcal{M},T\mathcal{M})$, we show that those operators are solely the scalar multiplication. It indicates that $\text{Diff}(\mathcal{M})$ is too rich and that there is no universal class of non-linear operators to motivate the design of Neural Networks over the symmetries of $\mathcal{M}$.  ( 2 min )
    TF-GNN: Graph Neural Networks in TensorFlow. (arXiv:2207.03522v1 [cs.LG])
    TensorFlow GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. Many production models at Google use TF-GNN and it has been recently released as an open source project. In this paper, we describe the TF-GNN data model, its Keras modeling API, and relevant capabilities such as graph sampling, distributed training, and accelerator support.  ( 2 min )
    Recent Results of Energy Disaggregation with Behind-the-Meter Solar Generation. (arXiv:2207.03490v1 [cs.LG])
    The rapid deployment of renewable generations such as photovoltaic (PV) generations brings great challenges to the resiliency of existing power systems. Because PV generations are volatile and typically invisible to the power system operator, estimating the generation and characterizing the uncertainty are in urgent need for operators to make insightful decisions. This paper summarizes our recent results on energy disaggregation at the substation level with Behind-the-Meter solar generation. We formulate the so-called ``partial label'' problem for energy disaggregation at substations, where the aggregate measurements contain the total consumption of multiple loads, and the existence of some loads is unknown. We develop two model-free disaggregation approaches based on deterministic dictionary learning and Bayesian dictionary learning, respectively. Unlike conventional methods which require fully annotated training data of individual loads, our approaches can extract load patterns given partially labeled aggregate data. Therefore, our partial label formulation is more applicable in the real world. Compared with deterministic dictionary learning, the Bayesian dictionary learning-based approach provides the uncertainty measure for the disaggregation results, at the cost of increased computational complexity. All the methods are validated by numerical experiments.  ( 2 min )
    HierarchicalForecast: A Python Benchmarking Framework for Hierarchical Forecasting. (arXiv:2207.03517v1 [stat.ML])
    Large collections of time series data are commonly organized into cross-sectional structures with different levels of aggregation; examples include product and geographical groupings. A necessary condition for coherent decision-making and planning, with such data sets, is for the dis-aggregated series' forecasts to add up exactly to the aggregated series forecasts, which motivates the creation of novel hierarchical forecasting algorithms. The growing interest of the Machine Learning community in cross-sectional hierarchical forecasting systems states that we are in a propitious moment to ensure that scientific endeavors are grounded on sound baselines. For this reason, we put forward the HierarchicalForecast library, which contains preprocessed publicly available datasets, evaluation metrics, and a compiled set of statistical baseline models. Our Python-based framework aims to bridge the gap between statistical, econometric modeling, and Machine Learning forecasting research. Code and documentation are available in https://github.com/Nixtla/hierarchicalforecast.  ( 2 min )
    G2L: A Geometric Approach for Generating Pseudo-labels that Improve Transfer Learning. (arXiv:2207.03554v1 [cs.LG])
    Transfer learning is a deep-learning technique that ameliorates the problem of learning when human-annotated labels are expensive and limited. In place of such labels, it uses instead the previously trained weights from a well-chosen source model as the initial weights for the training of a base model for a new target dataset. We demonstrate a novel but general technique for automatically creating such source models. We generate pseudo-labels according to an efficient and extensible algorithm that is based on a classical result from the geometry of high dimensions, the Cayley-Menger determinant. This G2L (``geometry to label'') method incrementally builds up pseudo-labels using a greedy computation of hypervolume content. We demonstrate that the method is tunable with respect to expected accuracy, which can be forecast by an information-theoretic measure of dataset similarity (divergence) between source and target. The results of 280 experiments show that this mechanical technique generates base models that have similar or better transferability compared to a baseline of models trained on extensively human-annotated ImageNet1K labels, yielding an overall error decrease of 0.43\%, and an error decrease in 4 out of 5 divergent datasets tested.  ( 2 min )
    The use of deep learning enables high diagnostic accuracy in detecting syndesmotic instability on weight-bearing CT scanning. (arXiv:2207.03568v1 [eess.IV])
    Delayed diagnosis of syndesmosis instability can lead to significant morbidity and accelerated arthritic change in the ankle joint. Weight-bearing computed tomography (WBCT) has shown promising potential for early and reliable detection of isolated syndesmotic instability using 3D volumetric measurements. While these measurements have been reported to be highly accurate, they are also experience-dependent, time-consuming, and need a particular 3D measurement software tool that leads the clinicians to still show more interest in the conventional diagnostic methods for syndesmotic instability. The purpose of this study was to increase accuracy, accelerate analysis time, and reduce inter-observer bias by automating 3D volume assessment of syndesmosis anatomy using WBCT scans. We conducted a retrospective study using previously collected WBCT scans of patients with unilateral syndesmotic instability. 144 bilateral ankle WBCT scans were evaluated (48 unstable, 96 control). We developed three deep learning (DL) models for analyzing WBCT scans to recognize syndesmosis instability. These three models included two state-of-the-art models (Model 1 - 3D convolutional neural network [CNN], and Model 2 - CNN with long short-term memory [LSTM]), and a new model (Model 3 - differential CNN LSTM) that we introduced in this study. Model 1 failed to analyze the WBCT scans (F1-score = 0). Model 2 only misclassified two cases (F1-score = 0.80). Model 3 outperformed Model 2 and achieved a nearly perfect performance, misclassifying only one case (F1-score = 0.91) in the control group as unstable while being faster than Model 2.  ( 3 min )
    VMAS: A Vectorized Multi-Agent Simulator for Collective Robot Learning. (arXiv:2207.03530v1 [cs.RO])
    While many multi-robot coordination problems can be solved optimally by exact algorithms, solutions are often not scalable in the number of robots. Multi-Agent Reinforcement Learning (MARL) is gaining increasing attention in the robotics community as a promising solution to tackle such problems. Nevertheless, we still lack the tools that allow us to quickly and efficiently find solutions to large-scale collective learning tasks. In this work, we introduce the Vectorized Multi-Agent Simulator (VMAS). VMAS is an open-source framework designed for efficient MARL benchmarking. It is comprised of a vectorized 2D physics engine written in PyTorch and a set of twelve challenging multi-robot scenarios. Additional scenarios can be implemented through a simple and modular interface. We demonstrate how vectorization enables parallel simulation on accelerated hardware without added complexity. When comparing VMAS to OpenAI MPE, we show how MPE's execution time increases linearly in the number of simulations while VMAS is able to execute 30,000 parallel simulations in under 10s, proving more than 100x faster. Using VMAS's RLlib interface, we benchmark our multi-robot scenarios using various Proximal Policy Optimization (PPO)-based MARL algorithms. VMAS's scenarios prove challenging in orthogonal ways for state-of-the-art MARL algorithms. The VMAS framework is available at https://github.com/proroklab/VectorizedMultiAgentSimulator. A video of VMAS scenarios and experiments is available at https://youtu.be/aaDRYfiesAY}{here}\footnote{\url{https://youtu.be/aaDRYfiesAY.  ( 3 min )
    AVDDPG: Federated reinforcement learning applied to autonomous platoon control. (arXiv:2207.03484v1 [cs.LG])
    Since 2016 federated learning (FL) has been an evolving topic of discussion in the artificial intelligence (AI) research community. Applications of FL led to the development and study of federated reinforcement learning (FRL). Few works exist on the topic of FRL applied to autonomous vehicle (AV) platoons. In addition, most FRL works choose a single aggregation method (usually weight or gradient aggregation). We explore FRL's effectiveness as a means to improve AV platooning by designing and implementing an FRL framework atop a custom AV platoon environment. The application of FRL in AV platooning is studied under two scenarios: (1) Inter-platoon FRL (Inter-FRL) where FRL is applied to AVs across different platoons; (2) Intra-platoon FRL (Intra-FRL) where FRL is applied to AVs within a single platoon. Both Inter-FRL and Intra-FRL are applied to a custom AV platooning environment using both gradient and weight aggregation to observe the performance effects FRL can have on AV platoons relative to an AV platooning environment trained without FRL. It is concluded that Intra-FRL using weight aggregation (Intra-FRLWA) provides the best performance for controlling an AV platoon. In addition, we found that weight aggregation in FRL for AV platooning provides increases in performance relative to gradient aggregation. Finally, a performance analysis is conducted for Intra-FRLWA versus a platooning environment without FRL for platoons of length 3, 4 and 5 vehicles. It is concluded that Intra-FRLWA largely out-performs the platooning environment that is trained without FRL.  ( 3 min )
    Deep Learning to Jointly Schema Match, Impute, and Transform Databases. (arXiv:2207.03536v1 [cs.DB])
    An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.  ( 2 min )
    A Novel IoT-based Framework for Non-Invasive Human Hygiene Monitoring using Machine Learning Techniques. (arXiv:2207.03529v1 [cs.LG])
    People's personal hygiene habits speak volumes about the condition of taking care of their bodies and health in daily lifestyle. Maintaining good hygiene practices not only reduces the chances of contracting a disease but could also reduce the risk of spreading illness within the community. Given the current pandemic, daily habits such as washing hands or taking regular showers have taken primary importance among people, especially for the elderly population living alone at home or in an assisted living facility. This paper presents a novel and non-invasive framework for monitoring human hygiene using vibration sensors where we adopt Machine Learning techniques. The approach is based on a combination of a geophone sensor, a digitizer, and a cost-efficient computer board in a practical enclosure. Monitoring daily hygiene routines may help healthcare professionals be proactive rather than reactive in identifying and controlling the spread of potential outbreaks within the community. The experimental result indicates that applying a Support Vector Machine (SVM) for binary classification exhibits a promising accuracy of ~95% in the classification of different hygiene habits. Furthermore, both tree-based classifier (Random Forrest and Decision Tree) outperforms other models by achieving the highest accuracy (100%), which means that classifying hygiene events using vibration and non-invasive sensors is possible for monitoring hygiene activity.  ( 3 min )
  • Open

    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v2 [stat.ML] UPDATED)
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the used data augmentation is chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data and during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution and learning it using Bayesian model selection, which has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.  ( 2 min )
    $k$-Median Clustering via Metric Embedding: Towards Better Initialization with Differential Privacy. (arXiv:2206.12895v2 [cs.DS] UPDATED)
    When designing clustering algorithms, the choice of initial centers is crucial for the quality of the learned clusters. In this paper, we develop a new initialization scheme, called HST initialization, for the $k$-median problem in the general metric space (e.g., discrete space induced by graphs), based on the construction of metric embedding tree structure of the data. From the tree, we propose a novel and efficient search algorithm, for good initial centers that can be used subsequently for the local search algorithm. Our proposed HST initialization can produce initial centers achieving lower errors than those from another popular initialization method, $k$-median++, with comparable efficiency. The HST initialization can also be extended to the setting of differential privacy (DP) to generate private initial centers. We show that the error from applying DP local search followed by our private HST initialization improves previous results on the approximation error, and approaches the lower bound within a small factor. Experiments justify the theory and demonstrate the effectiveness of our proposed method. Our approach can also be extended to the $k$-means problem.  ( 2 min )
    Predicting Opinion Dynamics via Sociologically-Informed Neural Networks. (arXiv:2207.03990v1 [cs.SI])
    Opinion formation and propagation are crucial phenomena in social networks and have been extensively studied across several disciplines. Traditionally, theoretical models of opinion dynamics have been proposed to describe the interactions between individuals (i.e., social interaction) and their impact on the evolution of collective opinions. Although these models can incorporate sociological and psychological knowledge on the mechanisms of social interaction, they demand extensive calibration with real data to make reliable predictions, requiring much time and effort. Recently, the widespread use of social media platforms provides new paradigms to learn deep learning models from a large volume of social media data. However, these methods ignore any scientific knowledge about the mechanism of social interaction. In this work, we present the first hybrid method called Sociologically-Informed Neural Network (SINN), which integrates theoretical models and social media data by transporting the concepts of physics-informed neural networks (PINNs) from natural science (i.e., physics) into social science (i.e., sociology and social psychology). In particular, we recast theoretical models as ordinary differential equations (ODEs). Then we train a neural network that simultaneously approximates the data and conforms to the ODEs that represent the social scientific knowledge. In addition, we extend PINNs by integrating matrix factorization and a language model to incorporate rich side information (e.g., user profiles) and structural knowledge (e.g., cluster structure of the social interaction network). Moreover, we develop an end-to-end training procedure for SINN, which involves Gumbel-Softmax approximation to include stochastic mechanisms of social interaction. Extensive experiments on real-world and synthetic datasets show SINN outperforms six baseline methods in predicting opinion dynamics.  ( 3 min )
    Variational Inference of overparameterized Bayesian Neural Networks: a theoretical and empirical study. (arXiv:2207.03859v1 [stat.ML])
    This paper studies the Variational Inference (VI) used for training Bayesian Neural Networks (BNN) in the overparameterized regime, i.e., when the number of neurons tends to infinity. More specifically, we consider overparameterized two-layer BNN and point out a critical issue in the mean-field VI training. This problem arises from the decomposition of the lower bound on the evidence (ELBO) into two terms: one corresponding to the likelihood function of the model and the second to the Kullback-Leibler (KL) divergence between the prior distribution and the variational posterior. In particular, we show both theoretically and empirically that there is a trade-off between these two terms in the overparameterized regime only when the KL is appropriately re-scaled with respect to the ratio between the the number of observations and neurons. We also illustrate our theoretical results with numerical experiments that highlight the critical choice of this ratio.  ( 2 min )
    Uniform Consistency in Nonparametric Mixture Models. (arXiv:2108.14003v2 [math.ST] UPDATED)
    We study uniform consistency in nonparametric mixture models as well as closely related mixture of regression (also known as mixed regression) models, where the regression functions are allowed to be nonparametric and the error distributions are assumed to be convolutions of a Gaussian density. We construct uniformly consistent estimators under general conditions while simultaneously highlighting several pain points in extending existing pointwise consistency results to uniform results. The resulting analysis turns out to be nontrivial, and several novel technical tools are developed along the way. In the case of mixed regression, we prove $L^1$ convergence of the regression functions while allowing for the component regression functions to intersect arbitrarily often, which presents additional technical challenges. We also consider generalizations to general (i.e. non-convolutional) nonparametric mixtures.  ( 2 min )
    Optimal sizing of a holdout set for safe predictive model updating. (arXiv:2202.06374v3 [stat.ML] UPDATED)
    Predictive risk scores are increasingly used to guide clinical or other interventions in complex settings, particularly healthcare. Directly updating a risk score used to guide interventions leads to biased risk estimates. We propose updating using a `holdout set' -- a subset of the population that does not receive risk-score-guided interventions -- to prevent this. Since samples in the holdout set do not benefit from risk predictions, its size must trade off performance of the updated risk score whilst minimising the number of held out samples. We prove that this approach outperforms simple alternatives, and by defining a general loss function describe conditions under which an optimal holdout size (OHS) can be readily identified. We introduce parametric and semi-parametric algorithms for OHS estimation and demonstrate their use on a recent risk score for pre-eclampsia. Based on these results, we argue that a holdout set is a safe, viable and easily implemented means to safely update predictive risk scores.  ( 2 min )
    Your Policy Regularizer is Secretly an Adversary. (arXiv:2203.12592v4 [cs.LG] UPDATED)
    Policy regularization methods such as maximum entropy regularization are widely used in reinforcement learning to improve the robustness of a learned policy. In this paper, we show how this robustness arises from hedging against worst-case perturbations of the reward function, which are chosen from a limited set by an imagined adversary. Using convex duality, we characterize this robust set of adversarial reward perturbations under KL and alpha-divergence regularization, which includes Shannon and Tsallis entropy regularization as special cases. Importantly, generalization guarantees can be given within this robust set. We provide detailed discussion of the worst-case reward perturbations, and present intuitive empirical examples to illustrate this robustness and its relationship with generalization. Finally, we discuss how our analysis complements and extends previous results on adversarial reward robustness and path consistency optimality conditions.  ( 2 min )
    A law of adversarial risk, interpolation, and label noise. (arXiv:2207.03933v1 [stat.ML])
    In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy under many circumstances. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the dependence of label noise and adversarial risk in terms of the data distribution. Our results are almost sharp without accounting for the inductive bias of the learning algorithm. We also show that inductive bias makes the effect of label noise much stronger.  ( 2 min )
    On the representation and learning of monotone triangular transport maps. (arXiv:2009.10303v2 [stat.ML] UPDATED)
    Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps$\unicode{x2014}$approximations of the Knothe$\unicode{x2013}$Rosenblatt (KR) rearrangement$\unicode{x2014}$are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes.  ( 3 min )
    ControlBurn: Nonlinear Feature Selection with Sparse Tree Ensembles. (arXiv:2207.03935v1 [stat.ML])
    ControlBurn is a Python package to construct feature-sparse tree ensembles that support nonlinear feature selection and interpretable machine learning. The algorithms in this package first build large tree ensembles that prioritize basis functions with few features and then select a feature-sparse subset of these basis functions using a weighted lasso optimization criterion. The package includes visualizations to analyze the features selected by the ensemble and their impact on predictions. Hence ControlBurn offers the accuracy and flexibility of tree-ensemble models and the interpretability of sparse generalized additive models. ControlBurn is scalable and flexible: for example, it can use warm-start continuation to compute the regularization path (prediction error for any number of selected features) for a dataset with tens of thousands of samples and hundreds of features in seconds. For larger datasets, the runtime scales linearly in the number of samples and features (up to a log factor), and the package support acceleration using sketching. Moreover, the ControlBurn framework accommodates feature costs, feature groupings, and $\ell_0$-based regularizers. The package is user-friendly and open-source: its documentation and source code appear on https://pypi.org/project/ControlBurn/ and https://github.com/udellgroup/controlburn/.  ( 2 min )
    Bayesian multi-objective optimization for stochastic simulators: an extension of the Pareto Active Learning method. (arXiv:2207.03842v1 [math.OC])
    This article focuses on the multi-objective optimization of stochastic simulators with high output variance, where the input space is finite and the objective functions are expensive to evaluate. We rely on Bayesian optimization algorithms, which use probabilistic models to make predictions about the functions to be optimized. The proposed approach is an extension of the Pareto Active Learning (PAL) algorithm for the estimation of Pareto-optimal solutions that makes it suitable for the stochastic setting. We named it Pareto Active Learning for Stochastic Simulators (PALS). The performance of PALS is assessed through numerical experiments over a set of bi-dimensional, bi-objective test problems. PALS exhibits superior performance when compared to other scalarization-based and random-search approaches.  ( 2 min )
    Supervising the Decoder of Variational Autoencoders to Improve Scientific Utility. (arXiv:2109.04561v3 [stat.ML] UPDATED)
    Probabilistic generative models are attractive for scientific modeling because their inferred parameters can be used to generate hypotheses and design experiments. This requires that the learned model provide an accurate representation of the input data and yield a latent space that effectively predicts outcomes relevant to the scientific question. Supervised Variational Autoencoders (SVAEs) have previously been used for this purpose, where a carefully designed decoder can be used as an interpretable generative model while the supervised objective ensures a predictive latent representation. Unfortunately, the supervised objective forces the encoder to learn a biased approximation to the generative posterior distribution, which renders the generative parameters unreliable when used in scientific models. This issue has remained undetected as reconstruction losses commonly used to evaluate model performance do not detect bias in the encoder. We address this previously-unreported issue by developing a second order supervision framework (SOS-VAE) that influences the decoder to induce a predictive latent representation. This ensures that the associated encoder maintains a reliable generative interpretation. We extend this technique to allow the user to trade-off some bias in the generative parameters for improved predictive performance, acting as an intermediate option between SVAEs and our new SOS-VAE. We also use this methodology to address missing data issues that often arise when combining recordings from multiple scientific experiments. We demonstrate the effectiveness of these developments using synthetic data and electrophysiological recordings with an emphasis on how our learned representations can be used to design scientific experiments.  ( 3 min )
    Fair Exploration via Axiomatic Bargaining. (arXiv:2106.02553v2 [cs.LG] UPDATED)
    Exploration is often necessary in online learning to maximize long-term reward, but it comes at the cost of short-term 'regret'. We study how this cost of exploration is shared across multiple groups. For example, in a clinical trial setting, patients who are assigned a sub-optimal treatment effectively incur the cost of exploration. When patients are associated with natural groups on the basis of, say, race or age, it is natural to ask whether the cost of exploration borne by any single group is 'fair'. So motivated, we introduce the 'grouped' bandit model. We leverage the theory of axiomatic bargaining, and the Nash bargaining solution in particular, to formalize what might constitute a fair division of the cost of exploration across groups. On the one hand, we show that any regret-optimal policy strikingly results in the least fair outcome: such policies will perversely leverage the most 'disadvantaged' groups when they can. More constructively, we derive policies that are optimally fair and simultaneously enjoy a small 'price of fairness'. We illustrate the relative merits of our algorithmic framework with a case study on contextual bandits for warfarin dosing where we are concerned with the cost of exploration across multiple races and age groups.  ( 3 min )
    Test Sample Accuracy Scales with Training Sample Density in Neural Networks. (arXiv:2106.08365v6 [cs.LG] UPDATED)
    Intuitively, one would expect accuracy of a trained neural network's prediction on test samples to correlate with how densely the samples are surrounded by seen training samples in representation space. We find that a bound on empirical training error smoothed across linear activation regions scales inversely with training sample density in representation space. Empirically, we verify this bound is a strong predictor of the inaccuracy of the network's prediction on test samples. For unseen test sets, including those with out-of-distribution samples, ranking test samples by their local region's error bound and discarding samples with the highest bounds raises prediction accuracy by up to 20% in absolute terms for image classification datasets, on average over thresholds.  ( 2 min )
    Bayesian Quantile and Expectile Optimisation. (arXiv:2001.04833v2 [stat.ML] UPDATED)
    Bayesian optimisation (BO) is widely used to optimise stochastic black box functions. While most BO approaches focus on optimising conditional expectations, many applications require risk-averse strategies and alternative criteria accounting for the distribution tails need to be considered. In this paper, we propose new variational models for Bayesian quantile and expectile regression that are well-suited for heteroscedastic noise settings. Our models consist of two latent Gaussian processes accounting respectively for the conditional quantile (or expectile) and the scale parameter of an asymmetric likelihood functions. Furthermore, we propose two BO strategies based on max-value entropy search and Thompson sampling, that are tailored to such models and that can accommodate large batches of points. Contrary to existing BO approaches for risk-averse optimisation, our strategies can directly optimise for the quantile and expectile, without requiring replicating observations or assuming a parametric form for the noise. As illustrated in the experimental section, the proposed approach clearly outperforms the state of the art in the heteroscedastic, non-Gaussian case.  ( 2 min )
    On data-driven chance constraint learning for mixed-integer optimization problems. (arXiv:2207.03844v1 [math.OC])
    When dealing with real-world optimization problems, decision-makers usually face high levels of uncertainty associated with partial information, unknown parameters, or complex relationships between these and the problem decision variables. In this work, we develop a novel Chance Constraint Learning (CCL) methodology with a focus on mixed-integer linear optimization problems which combines ideas from the chance constraint and constraint learning literature. Chance constraints set a probabilistic confidence level for a single or a set of constraints to be fulfilled, whereas the constraint learning methodology aims to model the functional relationship between the problem variables through predictive models. One of the main issues when establishing a learned constraint arises when we need to set further bounds for its response variable: the fulfillment of these is directly related to the accuracy of the predictive model and its probabilistic behaviour. In this sense, CCL makes use of linearizable machine learning models to estimate conditional quantiles of the learned variables, providing a data-driven solution for chance constraints. An open-access software has been developed to be used by practitioners. Furthermore, benefits from CCL have been tested in two real-world case studies, proving how robustness is added to optimal solutions when probabilistic bounds are set for learned constraints.  ( 2 min )
    Layer Adaptive Node Selection in Bayesian Neural Networks: Statistical Guarantees and Implementation Details. (arXiv:2108.11000v2 [stat.ML] UPDATED)
    Sparse deep neural networks have proven to be efficient for predictive model building in large-scale studies. Although several works have studied theoretical and numerical properties of sparse neural architectures, they have primarily focused on the edge selection. Sparsity through edge selection might be intuitively appealing; however, it does not necessarily reduce the structural complexity of a network. Instead pruning excessive nodes leads to a structurally sparse network with significant computational speedup during inference. To this end, we propose a Bayesian sparse solution using spike-and-slab Gaussian priors to allow for automatic node selection during training. The use of spike-and-slab prior alleviates the need of an ad-hoc thresholding rule for pruning. In addition, we adopt a variational Bayes approach to circumvent the computational challenges of traditional Markov Chain Monte Carlo (MCMC) implementation. In the context of node selection, we establish the fundamental result of variational posterior consistency together with the characterization of prior parameters. In contrast to the previous works, our theoretical development relaxes the assumptions of the equal number of nodes and uniform bounds on all network weights, thereby accommodating sparse networks with layer-dependent node structures or coefficient bounds. With a layer-wise characterization of prior inclusion probabilities, we discuss the optimal contraction rates of the variational posterior. We empirically demonstrate that our proposed approach outperforms the edge selection method in computational complexity with similar or better predictive performance. Our experimental evidence further substantiates that our theoretical work facilitates layer-wise optimal node recovery.  ( 3 min )
    Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities. (arXiv:2111.08851v3 [cs.LG] UPDATED)
    In recent times, deep neural networks achieved outstanding predictive performance on various classification and pattern recognition tasks. However, many real-world prediction problems have ordinal response variables, and this ordering information is ignored by conventional classification losses such as the multi-category cross-entropy. Ordinal regression methods for deep neural networks address this. One such method is the CORAL method, which is based on an earlier binary label extension framework and achieves rank consistency among its output layer tasks by imposing a weight-sharing constraint. However, while earlier experiments showed that CORAL's rank consistency is beneficial for performance, {it is limited by a weight-sharing constraint in a neural network's fully connected output layer. We propose a new method for rank-consistent ordinal regression without this limitation. Our rank-consistent ordinal regression framework (CORN) achieves rank consistency by a novel training scheme. This training scheme uses} conditional training sets to obtain the unconditional rank probabilities through applying the chain rule for conditional probability distributions. Experiments on various datasets demonstrate the efficacy of the proposed method to utilize the ordinal target information, and the absence of the weight-sharing restriction improves the performance substantially compared to the CORAL reference approach.  ( 3 min )
    Black and Gray Box Learning of Amplitude Equations: Application to Phase Field Systems. (arXiv:2207.03954v1 [stat.ML])
    We present a data-driven approach to learning surrogate models for amplitude equations, and illustrate its application to interfacial dynamics of phase field systems. In particular, we demonstrate learning effective partial differential equations describing the evolution of phase field interfaces from full phase field data. We illustrate this on a model phase field system, where analytical approximate equations for the dynamics of the phase field interface (a higher order eikonal equation and its approximation, the Kardar-Parisi-Zhang (KPZ) equation) are known. For this system, we discuss data-driven approaches for the identification of equations that accurately describe the front interface dynamics. When the analytical approximate models mentioned above become inaccurate, as we move beyond the region of validity of the underlying assumptions, the data-driven equations outperform them. In these regimes, going beyond black-box identification, we explore different approaches to learn data-driven corrections to the analytically approximate models, leading to effective gray box partial differential equations.  ( 2 min )
    Feature Selection Methods for Uplift Modeling and Heterogeneous Treatment Effect. (arXiv:2005.03447v2 [cs.LG] UPDATED)
    Uplift modeling is a causal learning technique that estimates subgroup-level treatment effects. It is commonly used in industry and elsewhere for tasks such as targeting ads. In a typical setting, uplift models can take thousands of features as inputs, which is costly and results in problems such as overfitting and poor model interpretability. Consequently, there is a need to select a subset of the most important features for modeling. However, traditional methods for doing feature selection are not fit for the task because they are designed for standard machine learning models whose target is importantly different from uplift models. To address this, we introduce a set of feature selection methods explicitly designed for uplift modeling, drawing inspiration from statistics and information theory. We conduct empirical evaluations on the proposed methods on publicly available datasets, demonstrating the advantages of the proposed methods compared to traditional feature selection. We make the proposed methods publicly available as a part of the CausalML open-source package.  ( 2 min )
    Understanding Gradual Domain Adaptation: Improved Analysis, Optimal Path and Beyond. (arXiv:2204.08200v2 [cs.LG] UPDATED)
    The vast majority of existing algorithms for unsupervised domain adaptation (UDA) focus on adapting from a labeled source domain to an unlabeled target domain directly in a one-off way. Gradual domain adaptation (GDA), on the other hand, assumes a path of $(T-1)$ unlabeled intermediate domains bridging the source and target, and aims to provide better generalization in the target domain by leveraging the intermediate ones. Under certain assumptions, Kumar et al. (2020) proposed a simple algorithm, Gradual Self-Training, along with a generalization bound in the order of $e^{O(T)} \left(\varepsilon_0+O\left(\sqrt{log(T)/n}\right)\right)$ for the target domain error, where $\varepsilon_0$ is the source domain error and $n$ is the data size of each domain. Due to the exponential factor, this upper bound becomes vacuous when $T$ is only moderately large. In this work, we analyze gradual self-training under more general and relaxed assumptions, and prove a significantly improved generalization bound as $\varepsilon_0+ O \left(T\Delta + T/\sqrt{n}\right) + \widetilde{O}\left(1/\sqrt{nT}\right)$, where $\Delta$ is the average distributional distance between consecutive domains. Compared with the existing bound with an exponential dependency on $T$ as a multiplicative factor, our bound only depends on $T$ linearly and additively. Perhaps more interestingly, our result implies the existence of an optimal choice of $T$ that minimizes the generalization error, and it also naturally suggests an optimal way to construct the path of intermediate domains so as to minimize the accumulative path length $T\Delta$ between the source and target. To corroborate the implications of our theory, we examine gradual self-training on multiple semi-synthetic and real datasets, which confirms our findings. We believe our insights provide a path forward toward the design of future GDA algorithms.  ( 3 min )
    One for All: Simultaneous Metric and Preference Learning over Multiple Users. (arXiv:2207.03609v1 [stat.ML])
    This paper investigates simultaneous preference and metric learning from a crowd of respondents. A set of items represented by $d$-dimensional feature vectors and paired comparisons of the form ``item $i$ is preferable to item $j$'' made by each user is given. Our model jointly learns a distance metric that characterizes the crowd's general measure of item similarities along with a latent ideal point for each user reflecting their individual preferences. This model has the flexibility to capture individual preferences, while enjoying a metric learning sample cost that is amortized over the crowd. We first study this problem in a noiseless, continuous response setting (i.e., responses equal to differences of item distances) to understand the fundamental limits of learning. Next, we establish prediction error guarantees for noisy, binary measurements such as may be collected from human respondents, and show how the sample complexity improves when the underlying metric is low-rank. Finally, we establish recovery guarantees under assumptions on the response distribution. We demonstrate the performance of our model on both simulated data and on a dataset of color preference judgements across a large number of users.  ( 2 min )
    HierarchicalForecast: A Python Benchmarking Framework for Hierarchical Forecasting. (arXiv:2207.03517v1 [stat.ML])
    Large collections of time series data are commonly organized into cross-sectional structures with different levels of aggregation; examples include product and geographical groupings. A necessary condition for coherent decision-making and planning, with such data sets, is for the dis-aggregated series' forecasts to add up exactly to the aggregated series forecasts, which motivates the creation of novel hierarchical forecasting algorithms. The growing interest of the Machine Learning community in cross-sectional hierarchical forecasting systems states that we are in a propitious moment to ensure that scientific endeavors are grounded on sound baselines. For this reason, we put forward the HierarchicalForecast library, which contains preprocessed publicly available datasets, evaluation metrics, and a compiled set of statistical baseline models. Our Python-based framework aims to bridge the gap between statistical, econometric modeling, and Machine Learning forecasting research. Code and documentation are available in https://github.com/Nixtla/hierarchicalforecast.  ( 2 min )
    TF-GNN: Graph Neural Networks in TensorFlow. (arXiv:2207.03522v1 [cs.LG])
    TensorFlow GNN (TF-GNN) is a scalable library for Graph Neural Networks in TensorFlow. It is designed from the bottom up to support the kinds of rich heterogeneous graph data that occurs in today's information ecosystems. Many production models at Google use TF-GNN and it has been recently released as an open source project. In this paper, we describe the TF-GNN data model, its Keras modeling API, and relevant capabilities such as graph sampling, distributed training, and accelerator support.  ( 2 min )
    Complementing Brightness Constancy with Deep Networks for Optical Flow Prediction. (arXiv:2207.03790v1 [cs.CV])
    State-of-the-art methods for optical flow estimation rely on deep learning, which require complex sequential training schemes to reach optimal performances on real-world data. In this work, we introduce the COMBO deep network that explicitly exploits the brightness constancy (BC) model used in traditional methods. Since BC is an approximate physical model violated in several situations, we propose to train a physically-constrained network complemented with a data-driven network. We introduce a unique and meaningful flow decomposition between the physical prior and the data-driven complement, including an uncertainty quantification of the BC model. We derive a joint training scheme for learning the different components of the decomposition ensuring an optimal cooperation, in a supervised but also in a semi-supervised context. Experiments show that COMBO can improve performances over state-of-the-art supervised networks, e.g. RAFT, reaching state-of-the-art results on several benchmarks. We highlight how COMBO can leverage the BC model and adapt to its limitations. Finally, we show that our semi-supervised method can significantly simplify the training procedure.  ( 2 min )
    A Non-isotropic Probabilistic Take on Proxy-based Deep Metric Learning. (arXiv:2207.03784v1 [cs.LG])
    Proxy-based Deep Metric Learning (DML) learns deep representations by embedding images close to their class representatives (proxies), commonly with respect to the angle between them. However, this disregards the embedding norm, which can carry additional beneficial context such as class- or image-intrinsic uncertainty. In addition, proxy-based DML struggles to learn class-internal structures. To address both issues at once, we introduce non-isotropic probabilistic proxy-based DML. We model images as directional von Mises-Fisher (vMF) distributions on the hypersphere that can reflect image-intrinsic uncertainties. Further, we derive non-isotropic von Mises-Fisher (nivMF) distributions for class proxies to better represent complex class-specific variances. To measure the proxy-to-image distance between these models, we develop and investigate multiple distribution-to-point and distribution-to-distribution metrics. Each framework choice is motivated by a set of ablational studies, which showcase beneficial properties of our probabilistic approach to proxy-based DML, such as uncertainty-awareness, better-behaved gradients during training, and overall improved generalization performance. The latter is especially reflected in the competitive performance on the standard DML benchmarks, where our approach compares favorably, suggesting that existing proxy-based DML can significantly benefit from a more probabilistic treatment. Code is available at github.com/ExplainableML/Probabilistic_Deep_Metric_Learning.  ( 2 min )
    Nonparametric Embeddings of Sparse High-Order Interaction Events. (arXiv:2207.03639v1 [cs.LG])
    High-order interaction events are common in real-world applications. Learning embeddings that encode the complex relationships of the participants from these events is of great importance in knowledge mining and predictive tasks. Despite the success of existing approaches, e.g. Poisson tensor factorization, they ignore the sparse structure underlying the data, namely the occurred interactions are far less than the possible interactions among all the participants. In this paper, we propose Nonparametric Embeddings of Sparse High-order interaction events (NESH). We hybridize a sparse hypergraph (tensor) process and a matrix Gaussian process to capture both the asymptotic structural sparsity within the interactions and nonlinear temporal relationships between the participants. We prove strong asymptotic bounds (including both a lower and an upper bound) of the sparsity ratio, which reveals the asymptotic properties of the sampled structure. We use batch-normalization, stick-breaking construction, and sparse variational GP approximations to develop an efficient, scalable model inference algorithm. We demonstrate the advantage of our approach in several real-world applications.  ( 2 min )

  • Open

    [P] A Website to generate Code Snippets, Regexes, Linux & Git & SQL Commands, HTML and CSS from a written description. Furthermore translate code snippets to many languages and get a regex explained in plain english. Moreover you can fix broken code snippets. All with the help of ML 🤖
    https://reddit.com/link/vw3tkf/video/xe0t4pumpta91/player https://reddit.com/link/vw3tkf/video/7pf9dl3npta91/player Programming Function from Description Code to Explanation Fix invalid Code Translate Languages Class from Description Get Language from Code Function from Docstring Helpers Regex from Description Regex to Explanation Linux Command Get time complexity Git Command from Description Database Text Description to SQL Command Web Generate HTML from Description CSS from Description Meta Tags from Description I think this could be helpful to a lot of people (especially for beginner programmers). You can check out all functionalities on your own here: programming-helper.com Have fun using the tool ❤️ submitted by /u/Capital_Revolution35 [link] [comments]  ( 86 min )
    [R] META first neural view synthesis method for VR / passthrough AR
    submitted by /u/SpatialComputing [link] [comments]  ( 86 min )
    [Project] Parakeet — Copilot for Colab
    Hello! I've long been a big fan of GitHub Copilot — I've used it for a while now, and I find it super helpful for all sorts of things. But Copilot doesn't work in Colab or Jupyter notebooks, even though that's where a ton of ML and data science code is written. Parakeet is a Chrome extension that provides Copilot-like code suggestions for notebooks. I've been using Parakeet for my own needs for a bit, and I'm already getting a lot of mileage out of it. Just the other day, for example, I wanted to make a Seaborn plot but wasn't sure how. I wrote a short comment, Parakeet suggested some code, and the code worked on the first try! Installation Install from the Chrome Web Store View source code You'll need an email to sign up. Parakeet is currently free to use for everyone, though that may change once OpenAI introduces pricing for Codex. Demos Generating code to plot a sine wave. Plotting a heat map. All I had to do was write some comments — Parakeet's suggested code worked on the first try. Limitations Parakeet currently only works for Colab, though I'm considering extending Parakeet to support Jupyter. If you want to use Parakeet outside Colab, I'd love to hear about your use case! You can file an issue on GitHub or you can email me at [ericyu3@gmail.com](mailto:ericyu3@gmail.com). To keep things simple, Parakeet only makes suggestions when you are at the end of a line, and Parakeet never makes multi-line suggestions. How it works Parakeet uses OpenAI's Codex model, which is the same model that powers GitHub Copilot. Parakeet does not have access to Colab's internal state. Instead, Parakeet continuously parses Colab's HTML to extract cell contents and determine what row and column your cursor is on. This approach was finicky to get working, but I was able to get it to work reliably and with little performance penalty. Your code is never stored or logged. After a suggestion is generated, the input is immediately discarded. submitted by /u/ericyu3 [link] [comments]  ( 88 min )
    Preparing Machine Learning Interview [D]
    Hi Everyone, I am preparing for the Machine Learning Engineer job. Now I am learning Data Structure & Algorithms along with coding problem-solving which are documented on my GitHub. I have 4 months' time. I am looking for remote/onsite jobs in Europe or anywhere. Any tips and suggestions are highly appreciated. Please! Sure, We can learn together. Here is my email: [lewissarron@gmail.com](mailto:lewissarron@gmail.com), if you are interested in the same. submitted by /u/Sandwich-Express [link] [comments]  ( 85 min )
    [D] Any french Corpus like ALECTOR for simplification task?
    Hello, the title says it all. I'm trying to find any ressources (mainly aligned corpus) that could be helpful in identifying and simplifying complex sentences in French. ALECTOR is the only one I stumbled upon. Do you have any resources or tips? I was wondering if searching for book and their simplified version could be useful but I fear it would be more like learning to translate old french into modern french. submitted by /u/Sacrezar [link] [comments]  ( 85 min )
    [D] Reimplementing an Object Detection Model.
    How hard is it to reimplement an object detection model to reproduce the results on benchmarks like COCO. Lets take the DINO architecture or even some yolo v4-7 Model. How hard is it to build it from scratch to reach COCO results reported by the paper or official implementations? submitted by /u/SeucheAchat9115 [link] [comments]  ( 86 min )
    [R] mixed reality future — see the world through artistic lenses — made with NeRF
    submitted by /u/SpatialComputing [link] [comments]  ( 89 min )
    [D] Interpreting Attention Weights
    I have seen in many papers, specially in Deep learning applications in medical imaging, that they interpret attention weights as something like interaction between features (ie. Feature Interaction). But, every time you train the model wouldn't you get new weights? Then, how does this interoperability holds any value if the weights keep changing everytime you run it? submitted by /u/Labib666Camp [link] [comments]  ( 87 min )
    [D] What's the problem with Self-driving cars? Is it a lack of data or do we need a new technology breakthrough?
    I mean there was a time when everyone thought that in a few years we would have self-drive cars. We just need more data and computing and we'll get it. But now Google has more than 20m miles on a public road and much more in simulations. And Tesla has a lot of cars that collect data on the road. But it's still not there so what is missing? Do we need a new technology breakthrough or it's just more data and computing power? submitted by /u/yosefschwartz [link] [comments]  ( 109 min )
    [D] Noam Chomsky on LLMs and discussion of LeCun paper (MLST)
    "First we should ask the question whether LLM have achieved ANYTHING, ANYTHING in this domain. Answer, NO, they have achieved ZERO!" - Noam Chomsky "There are engineering projects that are significantly advanced by [#DL] methods. And this is all the good. [...] Engineering is not a trivial field; it takes intelligence, invention, [and] creativity these achievements. That it contributes to science?" - Noam Chomsky "There was a time [supposedly dedicated] to the study of the nature of #intelligence. By now it has disappeared." Earlier, same interview: "GPT-3 can [only] find some superficial irregularities in the data. [...] It's exciting for reporters in the NY Times." - Noam Chomsky "It's not of interest to people, the idea of finding an explanation for something. [...] The [original #AI] field by now is considered old-fashioned, nonsense. [...] That's probably where the field will develop, where the money is. [...] But it's a shame." - Noam Chomsky Thanks to Dagmar Monett for selecting the quotes! Sorry for posting a controversial thread -- but this seemed noteworthy for /machinelearning Video: https://youtu.be/axuGfh4UR9Q -- also some discussion of LeCun's recent position paper submitted by /u/timscarfe [link] [comments]  ( 104 min )
  • Open

    Google AI Proposes ‘MLGO’: A Machine Learning Guided Compiler Optimization Python Framework
    Since the invention of modem computers, there has been a constant demand for optimization and speedier code compilation. Large data center programs can benefit much from optimization, but mobile and embedded systems, as well as software installed on protected boot partitions, need reduced code. As the area has developed, the headroom has been severely constrained by ever-complicated heuristics, preventing the maintenance and additional advancements. Recent studies have demonstrated that compiler optimization can significantly benefit by substituting ML strategies for complex heuristics. Adopting ML in all-purpose, industrial-strength compilers is still tricky, nevertheless. To solve this problem, a group of Google Research engineers has presented “MLGO: a Machine Learning Guided Compiler Optimizations Framework,” the first-ever broad industrial-grade framework for systematically integrating ML approaches with LLVM. LLVM is a well-known open-source industrial compiler infrastructure that creates critical high-performance software. To train neural networks to make decision policies that can replace heuristics in LLVM, MLGO uses reinforcement learning. The team has disclosed two MLGO optimizations for LLVM, the first involving inlining to reduce code size and the second involving register allocation to enhance code performance. Both improvements may be found in the LLVM source and have been used in real-world applications. Continue reading | Checkout the paper, github, demo and ref article. https://i.redd.it/tzobkzw6lta91.gif submitted by /u/ai-lover [link] [comments]  ( 85 min )
    [Question] Using planning on vimgolf (fewest vim commands to produce a given text) - feasability and design
    Today, I learned about https://www.vimgolf.com/ and thought that it looked somewhat like a planning problem (find the shortest sequence of text-manipulating commands that produces a certain text, the goal state). So I want to try to use planning algorithms to solve vimgolf problems. questions (For details, see spec) 1) Are current planning framework able to solve these problem instances? Considering that there are 10-20 vim commands I want to initially support, and problem instances like this: https://www.vimgolf.com/challenges/9v00619554dd000000000216. 2) Vim commands can be composed, for example 2dw deletes the next word, two times. One way to model this is that the agent could use the action 2, then d, then w, where only the last actions transforms the text (by deleting two words). Is that a good idea? 3) Which tools could I use for this task? So far, fast-downward seemed to be an option, or using one of the many solvers for STRIPS. However, I am a bit lost - I don't want any fancy stuff, I just want a planner that outputs a short sequence of vim-commands. spec Input: Text A and Text B. Output: A sequence of vim commands that transforms A to B. Minimize: Length of the command sequence. How I want to model the state: Text: String[][] lines Int cursor // maybe some vim-specific state, like the current mode How I find out if I come closer to the goal state: I thought about using some metric that uses 1) The Levenstein distance of the text and 2) the number of lines. submitted by /u/Proper_Elk_1726 [link] [comments]  ( 85 min )
    AI Generated Art with starryai
    With starryai you can generate art inspired by real life artists on your phone! submitted by /u/Keni9089 [link] [comments]  ( 84 min )
    mixed reality future — see the world through artistic lenses — made with NeRF
    submitted by /u/SpatialComputing [link] [comments]  ( 86 min )
    Created a completely AI generated comic page, images are all from different Midjourney prompts and the text is from OpenAI. I just stitched the various images together in Photoshop and added the text.
    submitted by /u/Albertrech [link] [comments]  ( 86 min )
    AI Dream 47 - Sacred House of Spirits vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 84 min )
    Fairy's Pure Beauty | Cinematic 4K 24 FPS (FILM)
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 84 min )
  • Open

    Adding a tiny trap to stop chaos
    The tent map is a famous example of a chaotic function. We will show how a tiny modification of the tent maps retains continuity of the function but prevents chaos. The tent map is the function f: [0, 1] → [0, 1] defined by This map has an unstable fixed point at x0 = 2/3.  […] Adding a tiny trap to stop chaos first appeared on John D. Cook.  ( 6 min )
    Ducci sequences
    Pick four integers a, b, c, and d. Now iterate the procedure that takes these four integers to |b – a|, |c – b|, |d – c|, |a – d| You could think of the four integers being arranged clockwise in a circle, taking the absolute value of difference between each number and its neighbor […] Ducci sequences first appeared on John D. Cook.  ( 5 min )
  • Open

    example on a real photo, the algorithm was not planned for a real photo.
    submitted by /u/vlad_ma [link] [comments]  ( 84 min )
    Free Idea: Detecting GPT-3 Plagiarism, with GPT-3?
    submitted by /u/Gereshes [link] [comments]  ( 84 min )
    Can you give me some pointers
    ​ The point is for the player that spawn at the bottom to go to the target. But at the moment the basically go in straight lines somewhere around the target. But in the future i want to control the target with the mouse so I want the dots to follow the target like planets spinning around around a black hole (basically they should follow the target , not snipe it ) The target moves bouncing on the wall diagonally. Now I choose the Neural net structure as 6, 8, 4. The processing is being done by this... ArrayList process(PVector pos, PVector vel, PVector acc) { PVector target = new PVector(Main.goal.x, Main.goal.y); ArrayList input = new ArrayList (Arrays.asList(pos.x, pos.y, vel.x, vel.y, target.x, target.y)); for (int i = 0; i = 1) input.set(i, 1 f); } return input; } And the 4 nodes as output : ArrayList ans = nn.process(pos, vel, Main.goal);nn.step++;//Interpret ansfloat up = ans.get(0);float down = ans.get(1);float right = ans.get(2);float left = ans.get(3);int x, y;if (up > down) x = -1;else x = 1;if (right > left) y = 1;else y = -1;if (up == down) y = 0;if (right == left) x = 0;acc = new PVector(x, y); vel.add(acc);vel.limit(5);pos.add(vel); The formula I use at the moment is public void caculateFitness() { //close to goal means how many times was the dot closer than 10 of the targetif (reachedGoal) {fitness = 5000 + 10f * closeToGoal;} else {fitness = 10 * closeToGoal;}} IN THE END: I want some suggestion in changing the formula , maybe the structure or something else submitted by /u/LaserDenis [link] [comments]  ( 85 min )
  • Open

    Why do Policy Gradient Methods work so well in Cooperative MARL? Evidence from Policy Representation
    In cooperative multi-agent reinforcement learning (MARL), due to its on-policy nature, policy gradient (PG) methods are typically believed to be less sample efficient than value decomposition (VD) methods, which are off-policy. However, some recent empirical studies demonstrate that with proper input representation and hyper-parameter tuning, multi-agent PG can achieve surprisingly strong performance compared to off-policy VD methods. Why could PG methods work so well? In this post, we will present concrete analysis to show that in certain scenarios, e.g., environments with a highly multi-modal reward landscape, VD can be problematic and lead to undesired outcomes. By contrast, PG methods with individual policies can converge to an optimal policy in these cases. In addition, PG methods wit…  ( 5 min )
  • Open

    Help with navigating a non-changing 3D environment with only camera / pixel information
    Hello! I am trying to train an agent to navigate to a specific point in a 3d environment. The agent will start off at a random location in the environment and must navigate to the same goal each time. The agent only has access to a front facing camera, so no collision / environment data The action space is Forward, left, backward, and right, along with look left and look right The observation space is a 200x200x1 image (grayscale) of what the agent can see Right now, a positive reward is given for movement and a negative reward is given for cancelling movement commands (e.g. trying to move forward and backward at the same time). A large positive reward is given if it reaches the goal. I am training with A2C and CNN using stable baselines. How do I go about incentivizing the agent to explore the environment? With the current reward function, it eventually just defaults to moving in 1 direction to accumulate reward. Are there any algorithms that can use the observation space and determine if the agent has already been at that specific location? Then I could assign a negative reward to staying in the same spot / getting stuck, which should allow it to eventually find the goal location Thanks for any tips / resources in advance! submitted by /u/Sandals5476 [link] [comments]  ( 87 min )

  • Open

    280+ AI tools for digital artists
    280+ AI tools for artists in one place. AI Library for artists Our team has created the biggest library of AI tools for digital artists, NFT creation and metaverse content. It's free and updated daily. We would really appreciate your feedback. As new mind blowing tools appear everyday, and we decided it would be useful to have a single place to have them all together with description and examples. In the end of July we will run a series of free workshops on how AI can be used by artists, so if you are interested to attend and try some tools, please join our waitlist, we announce Alfa soon. https://reddit.com/link/vvedsg/video/pzkzgy91sma91/player submitted by /u/Worldly_Apricot_1512 [link] [comments]  ( 84 min )
    행동하는 소녀!
    submitted by /u/VIRUS-AOTOXIN [link] [comments]  ( 83 min )
    Why does VQGAN+CLIP produce much worse results than Dalle-mini?
    Both models are using the vqgan_imagenet_f16_16384 model. I'm not sure what Dalle-mini does differently, but the results it produces are so much better. VQGAN+CLIP produces results that don't have anything in focus, even if the prompt is just a single object. I'm not sure if this is because of the augmentation randomization (affine, sharpness, color jitter) or not. For example, here are the results of both models' results on the prompt "an art deco car driving down the street": ​ dalle-mini ​ vqgan+clip: what even is this? and why does it keep producing abstract-looking art? submitted by /u/impurekitkat [link] [comments]  ( 84 min )
    "Voldemort" AI Art created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 84 min )
    First-Ever Course on Transformers: NOW PUBLIC
    CS 25: Transformers United https://preview.redd.it/p4nuskhwlla91.png?width=350&format=png&auto=webp&s=711b3587ab93f4d024c4841462181dfbaa49863c Did you grow up wanting to play with robots that could turn into cars? While we can't offer those kinds of transformers, we do have a course on the class of deep learning models that have taken the world by storm. Announcing the public release of our lectures from the first-ever course on Transformers: CS25 Transformers United (http://cs25.stanford.edu) held at Stanford University. Our intro video is out and available to watch here 👉: YouTube Link Bookmark and spread the word 🤗! (Twitter Thread) Speaker talks out starting Monday ... submitted by /u/DragonLord9 [link] [comments]  ( 84 min )
    Who wants an invite to midjourney
    I have tons of invites to hand out so who needs one! submitted by /u/Concept_Sir [link] [comments]  ( 85 min )
    Do we need AI to be able to handle the huge amount of scientific information?
    Scientific knowledge is increasing exponentially and the amount of research papers published on any given day is really too much for a human to understand. Once AI has gotten more transparent and error free and more intelligent, I can imagine that to be helpful with handling lots of data... Are there any approaches for using AI to handle that kind of information? submitted by /u/greentea387 [link] [comments]  ( 88 min )
    The many faces of Bozzer 🇬🇧
    submitted by /u/pixelz_ai [link] [comments]  ( 84 min )
    AI can use your brainwaves to see things that you can't
    submitted by /u/jormungandrsjig [link] [comments]  ( 84 min )
    Oil On Canvas Painting of Beautiful Scenery | 4K 24 FPS (FILM)
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 84 min )
    made with StarryAI
    submitted by /u/rikusorasephiroth [link] [comments]  ( 83 min )
    Next-Level Model Investigation: Midjourney, Disco Diffusion, DALL-E Flow
    submitted by /u/laul_pogan [link] [comments]  ( 83 min )
    Is there an AI that generates new words/languages?
    Like the title says, I was wondering if something like that exists. All I could find around by searching this is about generating a normal English text. I don't have the skills right now for doing it myself (I wish I had them, really) and I was wondering if maybe something like that already exist around. submitted by /u/IKB191 [link] [comments]  ( 84 min )
  • Open

    Guided Cost Learning
    Hello everyone, I have a question regarding the IRL algorithm proposed by Finn et. al. in [1], I was wondering if the method is a model-based or model-free IRL? To my knowledge model-based methods are methods which model the transition probability p(s'|s, a). In different papers i find this method to be classified both as model-based [2] and model-free method [3]. The method assumes unknown dynamics of the system => would be a model free approach However the method is based on maximum entropy IRL optimization and guided policy search RL which are both model-based approaches. Maybe I have mixed up some of the stuff and sorry for that. Any help would be greatly appreciated 😅 Thanks in advance. [1] https://arxiv.org/abs/1603.00448 [2] https://www.sciencedirect.com/science/article/abs/pii/S1367578820300511 [3] https://link.springer.com/article/10.1007/s11063-017-9702-7 submitted by /u/nuki96 [link] [comments]  ( 85 min )
    Why are Multi Arm Bandits Important?
    Hey guys, Im starting out my work in Multi Arm Bandits and their applications, and I find it to be extremely theoretical. A lot of algorithms require strict assumptions to work. Why is MAB considered important when a lot of Deep RL tasks perform well on real world scenarios (without the need for regret bounds)? What is the application of MAB outside theoretical proofs? The topic seems to be more mathematics than CS so Im curious how people feel. I know theres applications in scheduling and operations research, but do you think the theoretical aspects from MAB can be used to improve DeepRL tasks like in games? And those of you working on the topic, have you tried Deep RL, and if so what do you think of your work and whats been done there? Is MAB a completely separate field. For example Ive seen Computer Vision, NLP, AND Deep RL being combined in any order, but none of them do anything to do with bandits. Do you think these topics could find a common application? What is the current research trend? Ive not seen papers use anything but UCB or Thompson Sampling, so what do you work on? Finally, are there any recent works where MAB are combined with Deep Learning. Im trying to find a balance between the theory and research, but Im finding the proof and bounds to be a tedious task. submitted by /u/Bibbidi_Babbidi_Boo [link] [comments]  ( 85 min )
    Deepmind AI Researchers Introduce ‘DeepNash’, An Autonomous Agent Trained With Model-Free Multiagent Reinforcement Learning That Learns To Play The Game Of Stratego At Expert Level
    For several years, the Stratego board game has been regarded as one of the most promising areas of research in Artificial Intelligence. Stratego is a two-player board game in which each player attempts to take the other player’s flag. There are two main challenges in the game. 1) There are 10535 potential states in the Stratego game tree. 2) Each player in this game must consider 1066 possible deployments at the beginning of the game. Due to the various complex components of the game’s structure, the AI research community has made minimal progress in this area. This research introduces DeepNash, an autonomous agent that can develop human-level expertise in the imperfect information game Stratego from scratch. Regularized Nash Dynamics (R-NaD), a principled, model-free reinforcement learning technique, is the prime backbone of DeepNash. DeepNash achieves an ε-Nash equilibrium by integrating R-NaD with deep neural network architecture. A Nash equilibrium ensures that the agent will perform well even when faced with the worst-case scenario opponent. The stratego game and a description of the DeepNash technique are shown in Figure 1. Continue reading | Checkout the paper submitted by /u/ai-lover [link] [comments]  ( 85 min )
    [D] How to disable an action for a step
    Some action cannot be done and env gives such information. For example in SC2 the agent cannot train a unit if it doesn't have enough resources. How to prevent him for taking invalid action during exploration? I don't want to punish him with negative reward, because he may think that it's bad to do it. submitted by /u/CppMaster [link] [comments]  ( 85 min )
    I want to learn RL for a project can you suggest some sources from which I can learn?
    I want a crash course or something so that I can just get the knowledge that I need to apply in my project submitted by /u/RightLemon8889 [link] [comments]  ( 84 min )
    In general, are there any specific advantages of Multi Agent Reinforcement Learning w.r.t. to simple RL in terms of convergence, variance/bias or any other metric ?
    Also like if multiple agents can coordinate and learn a better policy in large state-action space. Also do MARL improve upon stability robustness etc ? submitted by /u/aabra__ka__daabra [link] [comments]  ( 84 min )
  • Open

    [P] CaiT Implementation in Flax
    An open-source implementation of the Going deeper with Image Transformers research paper in Google's JAX and Flax. "The paper also notes the difficulty in training vision transformers at greater depths and proposes two solutions. First, it proposes to do per-channel multiplication of the output of the residual block. Second, it proposes to have the patches attend to one another, and only allow the CLS token to attend to the patches in the last few layers." - Lucid Github repository for the Flax / JAX model: https://github.com/conceptofmind/CaiT-Flax CaiT Research Paper: https://arxiv.org/abs/2103.17239 Official PyTorch repository: https://github.com/rwightman/pytorch-image-models In collaboration with Lucid: https://github.com/lucidrains submitted by /u/EnricoShippole [link] [comments]  ( 85 min )
    [N] Designing Arithmetic Circuits with Deep Reinforcement Learning | NVIDIA Technical Blog
    submitted by /u/norcalnatv [link] [comments]  ( 86 min )
    [R] How to use ML to predict time of life remaining on a physical asset of the input data has had all its failed samples scrubbed away?
    So I'm in a bit of a conundrum. I'm working on my PhD thesis regarding the management of physical assets (make a decision on whether to replace the asset or refurbish it or to leave it alone). The first step to doing this is to predict the estimated time of life for each asset and I wish to use ML to do this. Each asset in my dataset has an installation date and a couple of input features (results of testing, characteristics of the asset, etc) The problem is the dataset I have doesn't have any of the failed assets. Meaning that I am finding it very hard to set up an error term for the estimated time of life during training of the model. Ideally, I should have failed samples and non-failed samples in my data but I only have the latter. How should I go about setting this up? I've been trying for the past couple of months to get my hands on failed samples but I haven't had any luck. submitted by /u/DrSkoolie [link] [comments]  ( 93 min )
    [P] May the best explanation win: A tutorial on benchmarking and tuning model explanations with pytorch-grad-cam
    The new release of the pytorch-grad-cam project focuses on metrics for the model explanations. It's often exciting to see model explanations, and tempting to interpret them and get insights about what the model is doing. And a lot of times it is very useful. However this has to be done with care - the model explanations can be wrong, or sub optimal. As shown in many papers, sometimes random explanations perform better. So it's useful to have metrics that measure the quality of the explanations for an image, and sanity checks about them. This can be used both for getting some trust in the explanation before using it, but also for tuning the explanation and getting the best one for a given image (for example by checking different methods). ​ This notebook gives a thorough overview of the different metrics used in the literature, issues with them, using sanity checks (like the Sobel Edge Detector, or a random CAM), and most importantly shows how to use them to chose and tune the explanation in practice. https://github.com/jacobgil/pytorch-grad-cam/blob/master/tutorials/CAM%20Metrics%20And%20Tuning%20Tutorial.ipynb ​ The motivation here is to both make it easier for researchers to benchmark new algorithms, but also (maybe more importantly) when using the model explanations to tune them, get the most out of them, and find problems with them. submitted by /u/jacobgil [link] [comments]  ( 86 min )
    [R] PrefixRL: Optimization Of Parallel Prefix Circuits Using Deep Reinforcement Learning
    submitted by /u/EducationalCicada [link] [comments]  ( 85 min )
    [N] First-Ever Course on Transformers: NOW PUBLIC
    CS 25: Transformers United https://preview.redd.it/1st4o3tvtha91.png?width=350&format=png&auto=webp&s=e4416da38001692989304e980dd4d61d23a74398 Did you grow up wanting to play with robots that could turn into cars? While we can't offer those kinds of transformers, we do have a course on the class of deep learning models that have taken the world by storm. Announcing the public release of our lectures from the first-ever course on Transformers: CS25 Transformers United (http://cs25.stanford.edu) held at Stanford University. Our intro video is out and available to watch here 👉: YouTube Link Bookmark and spread the word 🤗! (Twitter Thread) Speaker talks out starting Monday ... submitted by /u/DragonLord9 [link] [comments]  ( 89 min )
    [D] When did tech companies start to publish ML papers and why?
    I never fully understood the need for tech companies to publish research papers at big conferences. I think before the 2000s, tech companies were very secretive about their work. I mean, you wouldn't expect Microsoft to publish their latest research on their own motherboard at some conferences right? Nowadays all of them are trying to advertise their latest tech in research papers that could possibly be replicated by anyone around the world. This is especially visible in ML. Also it almost seems as if they don't have a goal in mind. A lot of the research papers (outside of those big models such as DALL-E) seem to be VERY random to me, hardly even related to their business interests. How did it become this way and what is their motivation? submitted by /u/fromnighttilldawn [link] [comments]  ( 99 min )
  • Open

    The neural network that I promised
    ​ https://preview.redd.it/uzt5qo20jia91.png?width=992&format=png&auto=webp&s=7061f894d96b5691391cb537f5791706a98cda04 https://preview.redd.it/jma65o20jia91.png?width=667&format=png&auto=webp&s=b5b2c3e44756e73efbb1a1033e42efc9504da158 https://sourceforge.net/projects/image-enlarger-free/ submitted by /u/vlad_ma [link] [comments]  ( 84 min )
  • Open

    Privacy-Preserving Synthetic Educational Data Generation. (arXiv:2207.03202v1 [cs.CY])
    Institutions collect massive learning traces but they may not disclose it for privacy issues. Synthetic data generation opens new opportunities for research in education. In this paper we present a generative model for educational data that can preserve the privacy of participants, and an evaluation framework for comparing synthetic data generators. We show how naive pseudonymization can lead to re-identification threats and suggest techniques to guarantee privacy. We evaluate our method on existing massive educational open datasets.  ( 2 min )

  • Open

    Is it possible to find a job in AI that is flexible enough I can pick up my three young children from school and not work from 2-5 M-F
    submitted by /u/CloudAtlas-2019 [link] [comments]  ( 84 min )
    An AI for students success prediction in academics.
    As the title asks. is there? submitted by /u/Psychological_Ad5132 [link] [comments]  ( 85 min )
    I complete my postgraduation in Cognitive Neuroscience and Im really interested in AI
    I want to select a good interesting topic in Artificial inteligence my background is cognitive neuorscience so i want some good topics that still AI field lacking so i thought Casual reasoning or any other topics from cognitive psychology side that help atleast in theory to implent in AI in future. What you guys think about Causal reasoning topics do u think AI lacks in that factor? submitted by /u/Cute_Understanding89 [link] [comments]  ( 86 min )
    Which countries have the highest demand for NLP engineers?
    I'm an AI master's student who is soon going to graduate. Although I have dealt with image processing and time series, I mainly focused on NLP when it came to projects. Hence, I am looking for employment in an environment that plays to my strengths. I am interested in hearing both personal opinions and hard data about which countries have a high demand for natural language processing. submitted by /u/Blutorangensaft [link] [comments]  ( 84 min )
    A small example from Tacotron2 trained on Brandon "Atrioc" Ewing
    submitted by /u/Phat_N_Sassy33 [link] [comments]  ( 84 min )
    Meta releases open source audio AI systems for more realistic VR and AR sound
    submitted by /u/henlo_there_fren [link] [comments]  ( 84 min )
    Systems courses for ml
    Hello everyone. Hope you are having a great time. I recently started a minor in cs. My final goal is to shift to ML and AI research or working in industry. In my minor courses I can choose one systems course, like computer systems as a prerequisite for operating systems for example. Though I'm choosing analysis of algorithm as it's obviously more important , I wanna know how important it is for someone who wants to work in AI as a machine learning engineer, data scientist or a researcher to take systems courses? would appreciate any answer. submitted by /u/BeneficialCharity8 [link] [comments]  ( 84 min )
    Analogybot.wtf: generate strange, funny, non-sensical and sometimes frighteningly accurate analogies with DaVinci AI.
    submitted by /u/syverlauritz [link] [comments]  ( 84 min )
    Moving Beyond Mimicry in Artificial Intelligence
    submitted by /u/estasfuera [link] [comments]  ( 83 min )
    Need advice to research!
    Hello everyone, I want to do some research and publish a paper ASAP in anomaly/fault detection using DL/NN & new to publications. I`m going through a lot of papers from conferences( A*) to analyze, but I am ending up in a constant loop. could anyone please provide your insights on how to target a conference with novel problems? submitted by /u/Bugfixer231 [link] [comments]  ( 84 min )
    Is there an AI that describes images?
    Like the title suggests, is there an AI that detects objects and total situation of a picture and puts those into words? submitted by /u/tastyogurt [link] [comments]  ( 84 min )
    UC Berkeley Researchers Introduce ‘Autocast’, A New Dataset For Measuring Machine Learning ML Models’ Forecasting Ability
    In this research article, the researchers from UC Berkeley demonstrated that extracting from a sizable news corpus may effectively train language models on prior predicting problems. Forecasting is a process that makes educated projections using previous data as inputs when identifying the direction of future trends. Forecasting future events in the real world, including pandemics, the economy, or the environment, is still complex but essential. Because dynamic information processing is a crucial component of efficient forecasting, AI researchers are considering using strong large-scale language models to automate these processes. Researchers present a dataset with tens of thousands of forecasting questions and a date-based news corpus in the new paper Forecasting Future World Events with Neural Networks. They also curate IntervalQA, a dataset with numerical questions and metrics for calibration. Continue reading | Checkout the paper and github submitted by /u/ai-lover [link] [comments]  ( 85 min )
    Ominous Escapade | Dark Galaxy | Raw UNSCALED (FILM)
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 84 min )
    The beginning of data-centric AI with data programming. What is data-centric AI?
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 84 min )
  • Open

    [D] How to evaluate a neural network in reverse?
    Say you have a neural network with 3 inputs, some hidden layers, and a single output. There might be many sets of those 3 inputs that give you the same output value. How can you evaluate this network in reverse, i.e. given an output value, find values of the 3 inputs that would yield that output? submitted by /u/zxkj [link] [comments]  ( 88 min )
    [Discussion] How do I smoothen the output of an action segmentation model near the boundaries?
    Hello. Apologies if this is the wrong place to post because my problem is a simple one related to machine learning. My problem involves a robot that operates given the output of an action segmentation model. The trained model outputs an action label at every timestep e.g. [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 3, 3]. Given the sequence, I must now compute the time it takes to go from one label to the next. However, the actual output tends to be quite unstable especially when the action transitions from one to the next e.g., [1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, ....]. As such, I occasionally get multiple transitions. This simple issue makes the input unusable to the robot. How can I clean up the output sequence? I thought of simple operations like converting the vector into a one-hot matrix and then running 1D erode and dilate operations but I was hoping to hear several other better suggestions. submitted by /u/applied-roboticist [link] [comments]  ( 87 min )
    [R] Single-task Continual/Incremental/Online/Life-Long learning.
    Hi everyone, I am new to the domain of continual learning/incremental learning/online learning/life-long learning (honestly, not able to make out the difference between them) and I would like to know if there exists a single-task life-long learning domain/problem. All the papers that I have gone through consist of methods trained for multiple tasks where newer tasks are added over time. I am looking for models trained for a single task that can be updated over time with new data belonging to the same task. I already have a trained model that I would like to update over time with either single or multiple data points. Any related links or directions would be greatly appreciated. TIA. submitted by /u/RohitDulam [link] [comments]  ( 86 min )
    [D] Thoughts on the autonomous vehicle (AV) field
    Curious to hear what people think of the future and current autonomous vehicle tech. Is it here to stay? Are we 5, 10 or 20+ years from true AV? What's the upside to society? Is it a worthwhile ML and AI research investment with potential benefits to other application areas? submitted by /u/purplebrown_updown [link] [comments]  ( 86 min )
    [P] Chart and Data Summarization
    I made an app that summarizes the data in csv files. Input a csv file and title of the file and the model will generate a summary. https://huggingface.co/spaces/saadob12/Chart_Data_Summarization The models: https://huggingface.co/saadob12/t5_C2T_autochart and https://huggingface.co/saadob12/t5_C2T_big. submitted by /u/QadriShyaari [link] [comments]  ( 85 min )
    [R] DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale - Microsoft 2022
    Paper: https://arxiv.org/pdf/2207.00032.pdf Abstract: The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extre…  ( 87 min )
    [D] Searching for a paper on equivalent transformations on trained networks
    I had come across a paper that explored strategies to transform the architecture of a trained neural network, i.e. increasing layer width or adding additional layers, without forgetting what the network has already learnt. They describe initialization strategies to accomplish this. Does anyone know the paper I am talking about? submitted by /u/kniranjankumar [link] [comments]  ( 86 min )
    [Discussion] Giving a machine learning presentation to laypeople
    Hello all, I've been asked to deliver a machine learning presentation to cardiologists and doctors, obviously they have no prior expertise in this area. I had wondered if anyone else had some experience presenting to Laypeople in the context of machine learning. Just looking for some ideas really, what would you cover? What examples would you give? How would you structure it? Any help is always appreciated ! [Edit#1] Thank you for the help everyone, this is some really useful feedback that I will take on bosrd submitted by /u/MidnightMaverick [link] [comments]  ( 91 min )
    [P] Detection by position rather than looks?
    I am working on a project that needs to decide which olive tree branches should be cut. The goal is to detect the specific type of branch (watersprouts). The problem I'm facing is that I'm unsure if I should use object detection (image classification + localization) or image segmentation. The difference between branches is mostly in their position with watersprouts growing mostly vertical to the main branch(there is a very small difference in looks between watersprouts and other branches) while other branches can grow in all ways (mostly parallel to the main branch). My plan was to use object detection so I can classify watersprouts and localize them in the picture. I think that segmentation is a bit overkill for this problem because I don't see the need for localizing every pixel. The plan was to take pictures of watersprouts as class 1 and other branches as class 2,train them so I can detect them and localize. When I localize them I can now see which of these branches is a watersprout branch and which is a regular branch and then I know that watersprout should be cut. The other problem I have is with understanding if it is possible for my machine learning project to recognize watersprouts not by their looks but by their position in regards to the main branch and correctly differentiate them from other branches because this is the main difference between watersprouts branch and regular branch. My understending is that the network learns how the object looks like and that position doesn't matter. Am I on a right track or am I missing something? submitted by /u/Greckon121 [link] [comments]  ( 88 min )
    [P] Sioyek 1.4 | Academic PDF Viewer
    During my PhD, I developed an open source PDF viewer to help me with my research. I think it can be useful for the users of this sub. Some of the research-oriented features include: Quickly jump to or preview references (for example Figure 3.1 for a figure or [8] for a reference). Works even if the document doesn't have links. Search paper names in google scholar by middle clicking on them (combined with the previous feature makes finding papers super fast) Searchable highlights/bookmarks Line-by-line highlighting for reduced eye strain (video) Synctex Support Extensible using external scripts (see this post for some examples) And many other features which are explained in the github page including marks, history, portals, searchable table of contents, automatic table of contents generation, searchable previous documents, etc. Here is a video demo of some of the features: https://www.youtube.com/watch?v=yTmCI0Xp5vI&t=3s And here is the latest release: https://github.com/ahrm/sioyek/releases/tag/v1.4.0 Disclaimer: I did introduce sioyek in this subreddit about a year ago, but it has changed a lot since then and some of the features suggested in the comments of last year's post are implemented, so I thought users of this subreddit might be interested in an update. submitted by /u/highergraphic [link] [comments]  ( 88 min )
    [D] when do eccv meta-reviews come out?
    I know the result from the link in the email but cmt still says "awaiting decision" call me antsy but I just want to see the final comments and meta-review... did it take this long last year? submitted by /u/gnohuhs [link] [comments]  ( 87 min )
    [R] NeurIPS2022’s Natural Language for Optimization (NL4Opt) competition!
    We invite you to join our NL4Opt competition that will be part of NeurIPS2022. We have a novel never-before-seen NLP dataset in hopes of making optimization solvers more accessible and usable. The competition aims to allow non-experts to use optimization tools in their decision-making. This competition is split into two main tasks: NER and generation. We have provided baselines for each to kick-start your implementation. We will award a total of $22,000 USD evenly across the two tasks. We will also be hosting a workshop at the end of the competition and will be inviting experts and winners as podium speakers. Additionally, we plan to host poster sessions for participants to share their solution. The competition is tentatively from July 1st to October 15th with the submission portal opening on July 15th. We look forward to your participation – you can register (https://nl4opt.github.io/participate/) and our organizers will be in touch with you shortly. For more information regarding the competition details, schedule, eligibility, rules, FAQs, and to get started, visit our competition website linked below! Follow our social media and GitHub discussion forum to keep updated. If you have any questions, please take a look at the FAQ section of our website. For any unanswered questions, free to start the discussion on the GitHub forum. Twitter: https://twitter.com/NL4Opt Website: https://nl4opt.github.io/ GitHub discussion forum: https://github.com/nl4opt/nl4opt-competition/discussions We look forward to your participation, NL4Opt Organizers submitted by /u/Adept_Ad_3308 [link] [comments]  ( 86 min )
    [D] LaMDA long-term memory
    Google's February, 2022 LaMDA paper says it is preconditioned on previous interactions (someone on this subreddit said 14-30) in support of tuning its "sensibleness" metric, which includes making sure responses don't contradict anything said earlier. However, in this podcast, Blake Lemoine says at 5:30-7:00 that LaMDA has some kind of long-term memory stretching back at least five years. He also mentions that the current system called "LaMDA 2" has access to a much wider variety of database resources than the paper or other Google publications describe, including Google Images, YouTube, and Google Books. Is LaMDA 2 documented anywhere? What other features does it have beyond what is documented in the February paper? submitted by /u/Competitive_Travel16 [link] [comments]  ( 88 min )
  • Open

    Using Learning Rate Schedules for Deep Learning Models in Python with Keras
    Training a neural network or large deep learning model is a difficult optimization task. The classical algorithm to train neural networks is called stochastic gradient descent. It has been well established that you can achieve increased performance and faster training on some problems by using a learning rate that changes during training. In this post […] The post Using Learning Rate Schedules for Deep Learning Models in Python with Keras appeared first on Machine Learning Mastery.  ( 24 min )
  • Open

    "Reinforcement Learning for Datacenter Congestion Control", Tessler et al 2021 {NV}
    submitted by /u/gwern [link] [comments]  ( 84 min )
    "DexMV: Imitation Learning for Dexterous Manipulation from Human Videos", Qin et al 2021
    submitted by /u/gwern [link] [comments]  ( 84 min )
    "Job Hunt as a PhD in RL: How it Actually Happens", Nato Lambert
    submitted by /u/gwern [link] [comments]  ( 85 min )
    Reward and step functions for path planning?
    Are there gym environments which provide reward and step functions for path planning? That is a set of waypoints as an output instead of throttle and steering submitted by /u/Mortang64 [link] [comments]  ( 84 min )
  • Open

    ​​Deep Hierarchical Planning from Pixels
    Posted by Danijar Hafner, Student Researcher, Google Research Research into how artificial agents can make decisions has evolved rapidly through advances in deep reinforcement learning. Compared to generative ML models like GPT-3 and Imagen, artificial agents can directly influence their environment through actions, such as moving a robot arm based on camera inputs or clicking a button in a web browser. While artificial agents have the potential to be increasingly helpful to people, current methods are held back by the need to receive detailed feedback in the form of frequently provided rewards to learn successful strategies. For example, despite large computational budgets, even powerful programs such as AlphaGo are limited to a few hundred moves until receiving their next reward. In co…  ( 26 min )
  • Open

    Gradient-based Neuromorphic Learning on Dynamical RRAM Arrays
    submitted by /u/Harley109 [link] [comments]  ( 84 min )
    Arnold Schwarzenegger One Liners
    I've been playing around with some Neural Network Text generator stuff and was wondering if anyone might know where I can get a compiled list of Arnold Schwarzenegger one liners for.. reasons.. submitted by /u/QwikMathz [link] [comments]  ( 84 min )
  • Open

    No Fueling Around: Designers Collaborate in Extended Reality on Porsche Electric Race Car
    A one-of-a-kind electric race car revved to life before it was manufactured — or even prototyped — thanks to GPU-powered extended reality technology. At the Automotive Innovation Forum in May, NVIDIA worked with Autodesk VRED to showcase a photorealistic Porsche electric sports car in augmented reality, with multiple attendees collaborating in the same immersive environment. Read article > The post No Fueling Around: Designers Collaborate in Extended Reality on Porsche Electric Race Car appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    How to tell if a document management system is ready for the future? (part-1)
    What is document management?  ( 8 min )
    Metaverse Technology and Human-AI Interaction
    The role of AI in the Metaverse has yet to be established. Is AI and blockchain technology a good fit?  ( 9 min )
  • Open

    Onboard PaddleOCR with Amazon SageMaker Projects for MLOps to perform optical character recognition on identity documents
    Optical character recognition (OCR) is the task of converting printed or handwritten text into machine-encoded text. OCR has been widely used in various scenarios, such as document electronization and identity authentication. Because OCR can greatly reduce the manual effort to register key information and serve as an entry step for understanding large volumes of documents, […]  ( 11 min )
  • Open

    A Study on Robustness to Perturbations for Representations of Environmental Sound. (arXiv:2203.10425v3 [cs.SD] UPDATED)
    Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions -- commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings -- YAMNet, and OpenL3 on monophonic (UrbanSound8K) and polyphonic (SONYC-UST) urban datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fr\'echet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study FAD in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL3 to be more robust than YAMNet, which aligns with the HEAR evaluation.  ( 3 min )
    Individual health-disease phase diagrams for disease prevention based on machine learning. (arXiv:2205.15598v2 [cs.LG] UPDATED)
    Early disease detection and prevention methods based on effective interventions are gaining attention. Machine learning technology has enabled precise disease prediction by capturing individual differences in multivariate data. Progress in precision medicine has revealed that substantial heterogeneity exists in health data at the individual level and that complex health factors are involved in the development of chronic diseases. However, it remains a challenge to identify individual physiological state changes in cross-disease onset processes because of the complex relationships among multiple biomarkers. Here, we present the health-disease phase diagram (HDPD), which represents a personal health state by visualizing the boundary values of multiple biomarkers that fluctuate early in the disease progression process. In HDPDs, future onset predictions are represented by perturbing multiple biomarker values while accounting for dependencies among variables. We constructed HDPDs for 11 non-communicable diseases (NCDs) from a longitudinal health checkup cohort of 3,238 individuals, comprising 3,215 measurement items and genetic data. Improvement of biomarker values to the non-onset region in HDPD significantly prevented future disease onset in 7 out of 11 NCDs. Our results demonstrate that HDPDs can represent individual physiological states in the onset process and be used as intervention goals for disease prevention.  ( 3 min )
    Some performance considerations when using multi-armed bandit algorithms in the presence of missing data. (arXiv:2205.03820v2 [stat.ML] UPDATED)
    When comparing the performance of multi-armed bandit algorithms, the potential impact of missing data is often overlooked. In practice, it also affects their implementation where the simplest approach to overcome this is to continue to sample according to the original bandit algorithm, ignoring missing outcomes. We investigate the impact on performance of this approach to deal with missing data for several bandit algorithms through an extensive simulation study assuming the rewards are missing at random. We focus on two-armed bandit algorithms with binary outcomes in the context of patient allocation for clinical trials with relatively small sample sizes. However, our results apply to other applications of bandit algorithms where missing data is expected to occur. We assess the resulting operating characteristics, including the expected reward. Different probabilities of missingness in both arms are considered. The key finding of our work is that when using the simplest strategy of ignoring missing data, the impact on the expected performance of multi-armed bandit strategies varies according to the way these strategies balance the exploration-exploitation trade-off. Algorithms that are geared towards exploration continue to assign samples to the arm with more missing responses (which being perceived as the arm with less observed information is deemed more appealing by the algorithm than it would otherwise be). In contrast, algorithms that are geared towards exploitation would rapidly assign a high value to samples from the arms with a current high mean irrespective of the level observations per arm. Furthermore, for algorithms focusing more on exploration, we illustrate that the problem of missing responses can be alleviated using a simple mean imputation approach.
    Learning grammar with a divide-and-concur neural network. (arXiv:2201.07341v3 [cs.CL] UPDATED)
    We implement a divide-and-concur iterative projection approach to context-free grammar inference. Unlike most state-of-the-art models of natural language processing, our method requires a relatively small number of discrete parameters, making the inferred grammar directly interpretable -- one can read off from a solution how to construct grammatically valid sentences. Another advantage of our approach is the ability to infer meaningful grammatical rules from just a few sentences, compared to the hundreds of gigabytes of training data many other models employ. We demonstrate several ways of applying our approach: classifying words and inferring a grammar from scratch, taking an existing grammar and refining its categories and rules, and taking an existing grammar and expanding its lexicon as it encounters new words in new data.
    Towards Better Understanding of Self-Supervised Representations. (arXiv:2203.01881v2 [cs.LG] UPDATED)
    Self-supervised learning methods have shown impressive results in downstream classification tasks. However, there is limited work in understanding and interpreting their learned representations. In this paper, we study the representation space of several state-of-the-art self-supervised models including SimCLR, SwaV, MoCo V2 and BYOL. Without the use of class label information, we first discover discriminative features that are highly active for various subsets of samples and correspond to unique physical attributes in images. We show that, using such discriminative features, one can compress the representation space of self-supervised models up to 50% without affecting downstream linear classification significantly. Next, we propose a sample-wise Self-Supervised Representation Quality Score (or, Q-Score) that can be computed without access to any label information. Q-Score, utilizes discriminative features to reliably predict if a given sample is likely to be mis-classified in the downstream classification task achieving AUPRC of 0.91 on SimCLR and BYOL trained on ImageNet-100. Q-Score can also be used as a regularization term to remedy low-quality representations leading up to 8% relative improvement in accuracy on all 4 self-supervised baselines on ImageNet-100, CIFAR-10, CIFAR-100 and STL-10. Moreover, through heatmap analysis, we show that Q-Score regularization enhances discriminative features and reduces feature noise, thus improving model interpretability.
    Classification of Time-Series Data Using Boosted Decision Trees. (arXiv:2110.00581v2 [cs.LG] UPDATED)
    Time-series data classification is central to the analysis and control of autonomous systems, such as robots and self-driving cars. Temporal logic-based learning algorithms have been proposed recently as classifiers of such data. However, current frameworks are either inaccurate for real-world applications, such as autonomous driving, or they generate long and complicated formulae that lack interpretability. To address these limitations, we introduce a novel learning method, called Boosted Concise Decision Trees (BCDTs), to generate binary classifiers that are represented as Signal Temporal Logic (STL) formulae. Our algorithm leverages an ensemble of Concise Decision Trees (CDTs) to improve the classification performance, where each CDT is a decision tree that is empowered by a set of techniques to generate simpler formulae and improve interpretability. The effectiveness and classification performance of our algorithm are evaluated on naval surveillance and urban-driving case studies.
    Exploiting Action Impact Regularity and Exogenous State Variables for Offline Reinforcement Learning. (arXiv:2111.08066v3 [cs.LG] UPDATED)
    Offline reinforcement learning -- learning a policy from a batch of data -- is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) with limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of real-world domains including financial markets. We discuss algorithms that exploits the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.
    Mitigating shortage of labeled data using clustering-based active learning with diversity exploration. (arXiv:2207.02964v1 [cs.LG])
    In this paper, we proposed a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), to address the shortage of labeled data. ALCS employs a density-based clustering approach to explore the cluster structure from the data without requiring exhaustive parameter tuning. A bi-cluster boundary-based sample query procedure is introduced to improve the learning performance for classifying highly overlapped classes. Additionally, we developed an effective diversity exploration strategy to address the redundancy among queried samples. Our experimental results justified the efficacy of the ALCS approach.
    Learning towards Robustness in Causally-Invariant Predictors. (arXiv:2107.01876v2 [stat.ML] UPDATED)
    We propose to learn an invariant causal predictor that is robust to distributional shifts, in the supervised regression scenario. Based on a disentangled causal factorization that describes the underlying data generating process, we attribute the distributional shifts to mutation of generating factors, which covers a wide range of cases of distributional shifts as we do not make prior specifications on the causal structure or the source of mutation. Under this causal framework, we identify a set of invariant predictors based on the do-operator. We provide a sufficient and necessary condition for a predictor to be min-max optimal, i.e., minimizes the worst-case quadratic loss among all domains. This condition is justifiable under the Markovian and faithfulness assumptions, thus inspiring a practical algorithm to identify the optimal predictor. For empirical estimation, we propose a permutation-regeneration scheme guided by a local causal discovery procedure. The utility and effectiveness of our method are demonstrated in simulation data and two real-world applications: Alzheimer's disease diagnosis and gene function prediction.
    Neural Stein critics with staged $L^2$-regularization. (arXiv:2207.03406v1 [stat.ML])
    Learning to differentiate model distributions from observed data is a fundamental problem in statistics and machine learning, and high-dimensional data remains a challenging setting for such problems. Metrics that quantify the disparity in probability distributions, such as the Stein discrepancy, play an important role in statistical testing in high dimensions. In this paper, we consider the setting where one wishes to distinguish between data sampled from an unknown probability distribution and a nominal model distribution. While recent studies revealed that the optimal $L^2$-regularized Stein critic equals the difference of the score functions of two probability distributions up to a multiplicative constant, we investigate the role of $L^2$ regularization when training a neural network Stein discrepancy critic function. Motivated by the Neural Tangent Kernel theory of training neural networks, we develop a novel staging procedure for the weight of regularization over training time. This leverages the advantages of highly-regularized training at early times while also empirically delaying overfitting. Theoretically, we relate the training dynamic with large regularization weight to the kernel regression optimization of "lazy training" regime in early training times. The benefit of the staged $L^2$ regularization is demonstrated on simulated high dimensional distribution drift data and an application to evaluating generative models of image data.
    Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation. (arXiv:2110.12328v4 [cs.LG] UPDATED)
    Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.
    FedHeN: Federated Learning in Heterogeneous Networks. (arXiv:2207.03031v1 [cs.LG])
    We propose a novel training recipe for federated learning with heterogeneous networks where each device can have different architectures. We introduce training with a side objective to the devices of higher complexities to jointly train different architectures in a federated setting. We empirically show that our approach improves the performance of different architectures and leads to high communication savings compared to the state-of-the-art methods.
    DAiSEE: Towards User Engagement Recognition in the Wild. (arXiv:1609.01885v7 [cs.CV] UPDATED)
    We introduce DAiSEE, the first multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration in the wild. The dataset has four levels of labels namely - very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists. We have also established benchmark results on this dataset using state-of-the-art video classification methods that are available today. We believe that DAiSEE will provide the research community with challenges in feature extraction, context-based inference, and development of suitable machine learning methods for related tasks, thus providing a springboard for further research. The dataset is available for download at https://people.iith.ac.in/vineethnb/resources/daisee/index.html.
    Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models. (arXiv:2110.04478v3 [cs.DC] UPDATED)
    Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multi-dimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72X (2.70X max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49X (2.25X max), 1.30X (1.78X max), 1.30X (1.77X max), and 1.25X (1.53X max), respectively.
    Differentially Private Stochastic Linear Bandits: (Almost) for Free. (arXiv:2207.03445v1 [cs.LG])
    In this paper, we propose differentially private algorithms for the problem of stochastic linear bandits in the central, local and shuffled models. In the central model, we achieve almost the same regret as the optimal non-private algorithms, which means we get privacy for free. In particular, we achieve a regret of $\tilde{O}(\sqrt{T}+\frac{1}{\epsilon})$ matching the known lower bound for private linear bandits, while the best previously known algorithm achieves $\tilde{O}(\frac{1}{\epsilon}\sqrt{T})$. In the local case, we achieve a regret of $\tilde{O}(\frac{1}{\epsilon}{\sqrt{T}})$ which matches the non-private regret for constant $\epsilon$, but suffers a regret penalty when $\epsilon$ is small. In the shuffled model, we also achieve regret of $\tilde{O}(\sqrt{T}+\frac{1}{\epsilon})$ %for small $\epsilon$ as in the central case, while the best previously known algorithm suffers a regret of $\tilde{O}(\frac{1}{\epsilon}{T^{3/5}})$. Our numerical evaluation validates our theoretical results.
    Minimax formula for the replica symmetric free energy of deep restricted Boltzmann machines. (arXiv:2005.09424v2 [cond-mat.dis-nn] UPDATED)
    We study the free energy of a most used deep architecture for restricted Boltzmann machines, where the layers are disposed in series. Assuming independent Gaussian distributed random weights, we show that the error term in the so-called replica symmetric sum rule can be optimised as a saddle point. This leads us to conjecture that in the replica symmetric approximation the free energy is given by a min max formula, which parallels the one achieved for two-layer case.
    Offline Meta-Reinforcement Learning with Online Self-Supervision. (arXiv:2107.03974v4 [cs.LG] UPDATED)
    Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks.
    Model Selection in Reinforcement Learning with General Function Approximations. (arXiv:2207.02992v1 [stat.ML])
    We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\mathcal{F}$ and $\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\cF$ and $\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.
    Distributionally Robust Policy Learning via Adversarial Environment Generation. (arXiv:2107.06353v6 [cs.RO] UPDATED)
    Our goal is to train control policies that generalize well to unseen environments. Inspired by the Distributionally Robust Optimization (DRO) framework, we propose DRAGEN - Distributionally Robust policy learning via Adversarial Generation of ENvironments - for iteratively improving robustness of policies to realistic distribution shifts by generating adversarial environments. The key idea is to learn a generative model for environments whose latent variables capture cost-predictive and realistic variations in environments. We perform DRO with respect to a Wasserstein ball around the empirical distribution of environments by generating realistic adversarial environments via gradient ascent on the latent space. We demonstrate strong Out-of-Distribution (OoD) generalization in simulation for (i) swinging up a pendulum with onboard vision and (ii) grasping realistic 3D objects. Grasping experiments on hardware demonstrate better sim2real performance compared to domain randomization.
    Pre-trained Gaussian processes for Bayesian optimization. (arXiv:2109.08215v4 [cs.LG] UPDATED)
    Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs on functions. However, even with expert knowledge, it is not an easy task to select a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. Theoretically, we show a bounded regret of BO with pre-trained priors. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.
    An Additive Instance-Wise Approach to Multi-class Model Interpretation. (arXiv:2207.03113v1 [cs.LG])
    Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system and whether to trust it for high-stakes decisions or large-scale deployment. Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches. Additive models use heuristically sampled perturbations to learn instance-specific explainers sequentially. The process is thus inefficient and susceptible to poorly-conditioned samples. Meanwhile, instance-wise techniques directly learn local sampling distributions and can leverage global information from other inputs. However, they can only interpret single-class predictions and suffer from inconsistency across different settings, due to a strict reliance on a pre-defined number of features selected. This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes. We also propose an adaptive inference strategy to determine the optimal number of features for a specific instance. Our model explainer significantly outperforms additive and instance-wise counterparts on faithfulness while achieves high level of brevity on various data sets and black-box model architectures.
    Stochastic optimal well control in subsurface reservoirs using reinforcement learning. (arXiv:2207.03456v1 [cs.LG])
    We present a case study of model-free reinforcement learning (RL) framework to solve stochastic optimal control for a predefined parameter uncertainty distribution and partially observable system. We focus on robust optimal well control problem which is a subject of intensive research activities in the field of subsurface reservoir management. For this problem, the system is partially observed since the data is only available at well locations. Furthermore, the model parameters are highly uncertain due to sparsity of available field data. In principle, RL algorithms are capable of learning optimal action policies -- a map from states to actions -- to maximize a numerical reward signal. In deep RL, this mapping from state to action is parameterized using a deep neural network. In the RL formulation of the robust optimal well control problem, the states are represented by saturation and pressure values at well locations while the actions represent the valve openings controlling the flow through wells. The numerical reward refers to the total sweep efficiency and the uncertain model parameter is the subsurface permeability field. The model parameter uncertainties are handled by introducing a domain randomisation scheme that exploits cluster analysis on its uncertainty distribution. We present numerical results using two state-of-the-art RL algorithms, proximal policy optimization (PPO) and advantage actor-critic (A2C), on two subsurface flow test cases representing two distinct uncertainty distributions of permeability field. The results were benchmarked against optimisation results obtained using differential evolution algorithm. Furthermore, we demonstrate the robustness of the proposed use of RL by evaluating the learned control policy on unseen samples drawn from the parameter uncertainty distribution that were not used during the training process.
    Fairness and Bias in Robot Learning. (arXiv:2207.03444v1 [cs.RO])
    Machine learning has significantly enhanced the abilities of robots, enabling them to perform a wide range of tasks in human environments and adapt to our uncertain real world. Recent works in various domains of machine learning have highlighted the importance of accounting for fairness to ensure that these algorithms do not reproduce human biases and consequently lead to discriminatory outcomes. With robot learning systems increasingly performing more and more tasks in our everyday lives, it is crucial to understand the influence of such biases to prevent unintended behavior toward certain groups of people. In this work, we present the first survey on fairness in robot learning from an interdisciplinary perspective spanning technical, ethical, and legal challenges. We propose a taxonomy for sources of bias and the resulting types of discrimination due to them. Using examples from different robot learning domains, we examine scenarios of unfair outcomes and strategies to mitigate them. We present early advances in the field by covering different fairness definitions, ethical and legal considerations, and methods for fair robot learning. With this work, we aim at paving the road for groundbreaking developments in fair robot learning.
    Challenges and Pitfalls of Bayesian Unlearning. (arXiv:2207.03227v1 [cs.LG])
    Machine unlearning refers to the task of removing a subset of training data, thereby removing its contributions to a trained model. Approximate unlearning are one class of methods for this task which avoid the need to retrain the model from scratch on the retained data. Bayes' rule can be used to cast approximate unlearning as an inference problem where the objective is to obtain the updated posterior by dividing out the likelihood of deleted data. However this has its own set of challenges as one often doesn't have access to the exact posterior of the model parameters. In this work we examine the use of the Laplace approximation and Variational Inference to obtain the updated posterior. With a neural network trained for a regression task as the guiding example, we draw insights on the applicability of Bayesian unlearning in practical scenarios.
    Directed Weight Neural Networks for Protein Structure Representation Learning. (arXiv:2201.13299v3 [q-bio.BM] UPDATED)
    A protein performs biological functions by folding to a particular 3D structure. To accurately model the protein structures, both the overall geometric topology and local fine-grained relations between amino acids (e.g. side-chain torsion angles and inter-amino-acid orientations) should be carefully considered. In this work, we propose the Directed Weight Neural Network for better capturing geometric relations among different amino acids. Extending a single weight from a scalar to a 3D directed vector, our new framework supports a rich set of geometric operations on both classical and SO(3)--representation features, on top of which we construct a perceptron unit for processing amino-acid information. In addition, we introduce an equivariant message passing paradigm on proteins for plugging the directed weight perceptrons into existing Graph Neural Networks, showing superior versatility in maintaining SO(3)-equivariance at the global scale. Experiments show that our network has remarkably better expressiveness in representing geometric relations in comparison to classical neural networks and the (globally) equivariant networks. It also achieves state-of-the-art performance on various computational biology applications related to protein 3D structures.
    Semi-unsupervised Learning for Time Series Classification. (arXiv:2207.03119v1 [cs.LG])
    Time series are ubiquitous and therefore inherently hard to analyze and ultimately to label or cluster. With the rise of the Internet of Things (IoT) and its smart devices, data is collected in large amounts any given second. The collected data is rich in information, as one can detect accidents (e.g. cars) in real time, or assess injury/sickness over a given time span (e.g. health devices). Due to its chaotic nature and massive amounts of datapoints, timeseries are hard to label manually. Furthermore new classes within the data could emerge over time (contrary to e.g. handwritten digits), which would require relabeling the data. In this paper we present SuSL4TS, a deep generative Gaussian mixture model for semi-unsupervised learning, to classify time series data. With our approach we can alleviate manual labeling steps, since we can detect sparsely labeled classes (semi-supervised) and identify emerging classes hidden in the data (unsupervised). We demonstrate the efficacy of our approach with established time series classification datasets from different domains.
    Softmax-free Linear Transformers. (arXiv:2207.03341v1 [cs.CV])
    Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by stacked self-attention operations. Employing self-attention modules results in a quadratic complexity in both computation and memory usage. Various attempts on approximating the self-attention computation with linear complexity have thus been made in Natural Language Processing. However, an in-depth analysis in this work reveals that they are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in retaining the softmax self-attention during approximations. Specifically, conventional self-attention is computed by normalizing the scaled dot-product between token feature vectors. Preserving the softmax operation challenges any subsequent linearization efforts. Under this insight, a SOftmax-Free Transformer (abbreviated as SOFT) is proposed for the first time. To eliminate the softmax operator in self-attention, a Gaussian kernel function is adopted to replace the dot-product similarity. This enables a full self-attention matrix to be approximated via a low-rank matrix decomposition. The robustness of our approximation is achieved by calculating its Moore-Penrose inverse using a Newton-Raphson method. Further, an efficient symmetric normalization is introduced on the low-rank self-attention for enhancing model generalizability and transferability. Extensive experiments on ImageNet, COCO and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity.
    Equivariant Representation Learning via Class-Pose Decomposition. (arXiv:2207.03116v1 [cs.LG])
    We introduce a general method for learning representations that are equivariant to symmetries of data. The central idea is to to decompose the latent space in an invariant factor and the symmetry group itself. The components semantically correspond to intrinsic data classes and poses respectively. The learner is self-supervised and infers these semantics based on relative symmetry information. The approach is motivated by theoretical results from group theory and guarantees representations that are lossless, interpretable and disentangled. We empirically investigate the approach via experiments involving datasets with a variety of symmetries. Results show that our representations capture the geometry of data and outperform other equivariant representation learning frameworks.
    Multi-Label Learning to Rank through Multi-Objective Optimization. (arXiv:2207.03060v1 [cs.IR])
    Learning to Rank (LTR) technique is ubiquitous in the Information Retrieval system nowadays, especially in the Search Ranking application. The query-item relevance labels typically used to train the ranking model are often noisy measurements of human behavior, e.g., product rating for product search. The coarse measurements make the ground truth ranking non-unique with respect to a single relevance criterion. To resolve ambiguity, it is desirable to train a model using many relevance criteria, giving rise to Multi-Label LTR (MLLTR). Moreover, it formulates multiple goals that may be conflicting yet important to optimize for simultaneously, e.g., in product search, a ranking model can be trained based on product quality and purchase likelihood to increase revenue. In this research, we leverage the Multi-Objective Optimization (MOO) aspect of the MLLTR problem and employ recently developed MOO algorithms to solve it. Specifically, we propose a general framework where the information from labels can be combined in a variety of ways to meaningfully characterize the trade-off among the goals. Our framework allows for any gradient based MOO algorithm to be used for solving the MLLTR problem. We test the proposed framework on two publicly available LTR datasets and one e-commerce dataset to show its efficacy.
    Adaptive Personlization in Federated Learning for Highly Non-i.i.d. Data. (arXiv:2207.03448v1 [cs.LG])
    Federated learning (FL) is a distributed learning method that offers medical institutes the prospect of collaboration in a global model while preserving the privacy of their patients. Although most medical centers conduct similar medical imaging tasks, their differences, such as specializations, number of patients, and devices, lead to distinctive data distributions. Data heterogeneity poses a challenge for FL and the personalization of the local models. In this work, we investigate an adaptive hierarchical clustering method for FL to produce intermediate semi-global models, so clients with similar data distribution have the chance of forming a more specialized model. Our method forms several clusters consisting of clients with the most similar data distributions; then, each cluster continues to train separately. Inside the cluster, we use meta-learning to improve the personalization of the participants' models. We compare the clustering approach with classical FedAvg and centralized training by evaluating our proposed methods on the HAM10k dataset for skin lesion classification with extreme heterogeneous data distribution. Our experiments demonstrate significant performance gain in heterogeneous distribution compared to standard FL methods in classification accuracy. Moreover, we show that the models converge faster if applied in clusters and outperform centralized training while using only a small subset of data.
    Y-Net: A Spatiospectral Dual-Encoder Networkfor Medical Image Segmentation. (arXiv:2204.07613v2 [eess.IV] UPDATED)
    Automated segmentation of retinal optical coherence tomography (OCT) images has become an important recent direction in machine learning for medical applications. We hypothesize that the anatomic structure of layers and their high-frequency variation in OCT images make retinal OCT a fitting choice for extracting spectral-domain features and combining them with spatial domain features. In this work, we present $\Upsilon$-Net, an architecture that combines the frequency domain features with the image domain to improve the segmentation performance of OCT images. The results of this work demonstrate that the introduction of two branches, one for spectral and one for spatial domain features, brings a very significant improvement in fluid segmentation performance and allows outperformance as compared to the well-known U-Net model. Our improvement was 13% on the fluid segmentation dice score and 1.9% on the average dice score. Finally, removing selected frequency ranges in the spectral domain demonstrates the impact of these features on the fluid segmentation outperformance.
    On the Relationship Between Adversarial Robustness and Decision Region in Deep Neural Network. (arXiv:2207.03400v1 [cs.LG])
    In general, Deep Neural Networks (DNNs) are evaluated by the generalization performance measured on unseen data excluded from the training phase. Along with the development of DNNs, the generalization performance converges to the state-of-the-art and it becomes difficult to evaluate DNNs solely based on this metric. The robustness against adversarial attack has been used as an additional metric to evaluate DNNs by measuring their vulnerability. However, few studies have been performed to analyze the adversarial robustness in terms of the geometry in DNNs. In this work, we perform an empirical study to analyze the internal properties of DNNs that affect model robustness under adversarial attacks. In particular, we propose the novel concept of the Populated Region Set (PRS), where training samples are populated more frequently, to represent the internal properties of DNNs in a practical setting. From systematic experiments with the proposed concept, we provide empirical evidence to validate that a low PRS ratio has a strong relationship with the adversarial robustness of DNNs. We also devise PRS regularizer leveraging the characteristics of PRS to improve the adversarial robustness without adversarial training.
    Back to the Source: Diffusion-Driven Test-Time Adaptation. (arXiv:2207.03442v1 [cs.LG])
    Test-time adaptation harnesses test inputs to improve the accuracy of a model trained on source data when tested on shifted target data. Existing methods update the source model by (re-)training on each target domain. While effective, re-training is sensitive to the amount and order of the data and the hyperparameters for optimization. We instead update the target data, by projecting all test inputs toward the source domain with a generative diffusion model. Our diffusion-driven adaptation method, DDA, shares its models for classification and generation across all domains. Both models are trained on the source domain, then fixed during testing. We augment diffusion with image guidance and self-ensembling to automatically decide how much to adapt. Input adaptation by DDA is more robust than prior model adaptation approaches across a variety of corruptions, architectures, and data regimes on the ImageNet-C benchmark. With its input-wise updates, DDA succeeds where model adaptation degrades on too little data in small batches, dependent data in non-uniform order, or mixed data with multiple corruptions.
    SC2EGSet: StarCraft II Esport Replay and Game-state Dataset. (arXiv:2207.03428v1 [cs.LG])
    As a relatively new form of sport, esports offers unparalleled data availability. Despite the vast amounts of data that are generated by game engines, it can be challenging to extract them and verify their integrity for the purposes of practical and scientific use. Our work aims to open esports to a broader scientific community by supplying raw and pre-processed files from StarCraft II esports tournaments. These files can be used in statistical and machine learning modeling tasks and related to various laboratory-based measurements (e.g., behavioral tests, brain imaging). We have gathered publicly available game-engine generated "replays" of tournament matches and performed data extraction and cleanup using a low-level application programming interface (API) parser library. Additionally, we open-sourced and published all the custom tools that were developed in the process of creating our dataset. These tools include PyTorch and PyTorch Lightning API abstractions to load and model the data. Our dataset contains replays from major and premiere StarCraft II tournaments since 2016. To prepare the dataset, we processed 55 tournament "replaypacks" that contained 17930 files with game-state information. Based on initial investigation of available StarCraft II datasets, we observed that our dataset is the largest publicly available source of StarCraft II esports data upon its publication. Analysis of the extracted data holds promise for further Artificial Intelligence (AI), Machine Learning (ML), psychological, Human-Computer Interaction (HCI), and sports-related studies in a variety of supervised and self-supervised tasks.
    Learning Optimal Solutions via an LSTM-Optimization Framework. (arXiv:2207.02937v1 [cs.LG])
    In this study, we present a deep learning-optimization framework to tackle dynamic mixed-integer programs. Specifically, we develop a bidirectional Long Short Term Memory (LSTM) framework that can process information forward and backward in time to learn optimal solutions to sequential decision-making problems. We demonstrate our approach in predicting the optimal decisions for the single-item capacitated lot-sizing problem (CLSP), where a binary variable denotes whether to produce in a period or not. Due to the dynamic nature of the problem, the CLSP can be treated as a sequence labeling task where a recurrent neural network can capture the problem's temporal dynamics. Computational results show that our LSTM-Optimization (LSTM-Opt) framework significantly reduces the solution time of benchmark CLSP problems without much loss in feasibility and optimality. For example, the predictions at the 85\% level reduce the CPLEX solution time by a factor of 9 on average for over 240,000 test instances with an optimality gap of less than 0.05\% and 0.4\% infeasibility in the test set. Also, models trained using shorter planning horizons can successfully predict the optimal solution of the instances with longer planning horizons. For the hardest data set, the LSTM predictions at the 25\% level reduce the solution time of 70 CPU hours to less than 2 CPU minutes with an optimality gap of 0.8\% and without any infeasibility. The LSTM-Opt framework outperforms classical ML algorithms, such as the logistic regression and random forest, in terms of the solution quality, and exact approaches, such as the ($\ell$, S) and dynamic programming-based inequalities, with respect to the solution time improvement. Our machine learning approach could be beneficial in tackling sequential decision-making problems similar to CLSP, which need to be solved repetitively, frequently, and in a fast manner.
    Riemannian Diffusion Schr\"odinger Bridge. (arXiv:2207.03024v1 [stat.ML])
    Score-based generative models exhibit state of the art performance on density estimation and generative modeling tasks. These models typically assume that the data geometry is flat, yet recent extensions have been developed to synthesize data living on Riemannian manifolds. Existing methods to accelerate sampling of diffusion models are typically not applicable in the Riemannian setting and Riemannian score-based methods have not yet been adapted to the important task of interpolation of datasets. To overcome these issues, we introduce \emph{Riemannian Diffusion Schr\"odinger Bridge}. Our proposed method generalizes Diffusion Schr\"odinger Bridge introduced in \cite{debortoli2021neurips} to the non-Euclidean setting and extends Riemannian score-based models beyond the first time reversal. We validate our proposed method on synthetic data and real Earth and climate data.
    Network Binarization via Contrastive Learning. (arXiv:2207.02970v1 [cs.CV])
    Neural network binarization accelerates deep models by quantizing their weights and activations into 1-bit. However, there is still a huge performance gap between Binary Neural Networks (BNNs) and their full-precision (FP) counterparts. As the quantization error caused by weights binarization has been reduced in earlier works, the activations binarization becomes the major obstacle for further improvement of the accuracy. BNN characterises a unique and interesting structure, where the binary and latent FP activations exist in the same forward pass (\textit{i.e.} $\text{Binarize}(\mathbf{a}_F) = \mathbf{a}_B$). To mitigate the information degradation caused by the binarization operation from FP to binary activations, we establish a novel contrastive learning framework while training BNNs through the lens of Mutual Information (MI) maximization. MI is introduced as the metric to measure the information shared between binary and FP activations, which assists binarization with contrastive learning. Specifically, the representation ability of the BNNs is greatly strengthened via pulling the positive pairs with binary and FP activations from the same input samples, as well as pushing negative pairs from different samples (the number of negative pairs can be exponentially large). This benefits the downstream tasks, not only classification but also segmentation and depth estimation,~\textit{etc}. The experimental results show that our method can be implemented as a pile-up module on existing state-of-the-art binarization methods and can remarkably improve the performance over them on CIFAR-10/100 and ImageNet, in addition to the great generalization ability on NYUD-v2.
    Cross-Scale Vector Quantization for Scalable Neural Speech Coding. (arXiv:2207.03067v1 [cs.SD])
    Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement. In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves the quality as more bits are available. The proposed CSVQ scheme can be flexibly applied to any neural audio coding network with a mirrored auto-encoder structure to achieve bitrate scalability. Subjective results show that the proposed scheme outperforms the classical residual VQ (RVQ) with scalability. Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at 3kbps and it could provide a graceful quality boost with bitrate increase.
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v2 [cs.LG] UPDATED)
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieving superior prediction accuracy and providing more interpretable fits than existing models.
    Towards the Practical Utility of Federated Learning in the Medical Domain. (arXiv:2207.03075v1 [cs.LG])
    Federated learning (FL) is an active area of research. One of the most suitable areas for adopting FL is the medical domain, where patient privacy must be respected. Previous research, however, does not fully consider who will most likely use FL in the medical domain. It is not the hospitals who are eager to adopt FL, but the service providers such as IT companies who want to develop machine learning models with real patient records. Moreover, service providers would prefer to focus on maximizing the performance of the models at the lowest cost possible. In this work, we propose empirical benchmarks of FL methods considering both performance and monetary cost with three real-world datasets: electronic health records, skin cancer images, and electrocardiogram datasets. We also propose Federated learning with Proximal regularization eXcept local Normalization (FedPxN), which, using a simple combination of FedProx and FedBN, outperforms all other FL algorithms while consuming only slightly more power than the most power efficient method.
    NESC: Robust Neural End-2-End Speech Coding with GANs. (arXiv:2207.03282v1 [eess.AS])
    Neural networks have proven to be a formidable tool to tackle the problem of speech coding at very low bit rates. However, the design of a neural coder that can be operated robustly under real-world conditions remains a major challenge. Therefore, we present Neural End-2-End Speech Codec (NESC) a robust, scalable end-to-end neural speech codec for high-quality wideband speech coding at 3 kbps. The encoder uses a new architecture configuration, which relies on our proposed Dual-PathConvRNN (DPCRNN) layer, while the decoder architecture is based on our previous work Streamwise-StyleMelGAN. Our subjective listening tests on clean and noisy speech show that NESC is particularly robust to unseen conditions and signal perturbations.
    CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks. (arXiv:2204.10965v3 [cs.CV] UPDATED)
    In this paper, we propose CLIP-Dissect, a new technique to automatically describe the function of individual hidden neurons inside vision networks. CLIP-Dissect leverages recent advances in multimodal vision/language models to label internal neurons with open-ended concepts without the need for any labeled data or human examples, which are required for existing tools to succeed. We show that CLIP-Dissect provides more accurate descriptions than existing methods for last layer neurons where the ground-truth is available as well as qualitatively good descriptions for hidden layer neurons. In addition, our method is very flexible: it is model agnostic, can easily handle new concepts and can be extended to take advantage of better multimodal models in the future. Finally CLIP-Dissect is computationally efficient and can label all neurons from five layers of ResNet-50 in just four minutes.
    Selectively increasing the diversity of GAN-generated samples. (arXiv:2207.01561v2 [cs.CV] UPDATED)
    Generative Adversarial Networks (GANs) are powerful models able to synthesize data samples closely resembling the distribution of real data, yet the diversity of those generated samples is limited due to the so-called mode collapse phenomenon observed in GANs. Especially prone to mode collapse are conditional GANs, which tend to ignore the input noise vector and focus on the conditional information. Recent methods proposed to mitigate this limitation increase the diversity of generated samples, yet they reduce the performance of the models when similarity of samples is required. To address this shortcoming, we propose a novel method to selectively increase the diversity of GAN-generated samples. By adding a simple, yet effective regularization to the training loss function we encourage the generator to discover new data modes for inputs related to diverse outputs while generating consistent samples for the remaining ones. More precisely, we maximise the ratio of distances between generated images and input latent vectors scaling the effect according to the diversity of samples for a given conditional input. We show the superiority of our method in a synthetic benchmark as well as a real-life scenario of simulating data from the Zero Degree Calorimeter of ALICE experiment in LHC, CERN.
    Towards Transparency in Dermatology Image Datasets with Skin Tone Annotations by Experts, Crowds, and an Algorithm. (arXiv:2207.02942v1 [cs.CV])
    While artificial intelligence (AI) holds promise for supporting healthcare providers and improving the accuracy of medical diagnoses, a lack of transparency in the composition of datasets exposes AI models to the possibility of unintentional and avoidable mistakes. In particular, public and private image datasets of dermatological conditions rarely include information on skin color. As a start towards increasing transparency, AI researchers have appropriated the use of the Fitzpatrick skin type (FST) from a measure of patient photosensitivity to a measure for estimating skin tone in algorithmic audits of computer vision applications including facial recognition and dermatology diagnosis. In order to understand the variability of estimated FST annotations on images, we compare several FST annotation methods on a diverse set of 460 images of skin conditions from both textbooks and online dermatology atlases. We find the inter-rater reliability between three board-certified dermatologists is comparable to the inter-rater reliability between the board-certified dermatologists and two crowdsourcing methods. In contrast, we find that the Individual Typology Angle converted to FST (ITA-FST) method produces annotations that are significantly less correlated with the experts' annotations than the experts' annotations are correlated with each other. These results demonstrate that algorithms based on ITA-FST are not reliable for annotating large-scale image datasets, but human-centered, crowd-based protocols can reliably add skin type transparency to dermatology datasets. Furthermore, we introduce the concept of dynamic consensus protocols with tunable parameters including expert review that increase the visibility of crowdwork and provide guidance for future crowdsourced annotations of large image datasets.
    Self-Supervised Velocity Estimation for Automotive Radar Object Detection Networks. (arXiv:2207.03146v1 [cs.CV])
    This paper presents a method to learn the Cartesian velocity of objects using an object detection network on automotive radar data. The proposed method is self-supervised in terms of generating its own training signal for the velocities. Labels are only required for single-frame, oriented bounding boxes (OBBs). Labels for the Cartesian velocities or contiguous sequences, which are expensive to obtain, are not required. The general idea is to pre-train an object detection network without velocities using single-frame OBB labels, and then exploit the network's OBB predictions on unlabelled data for velocity training. In detail, the network's OBB predictions of the unlabelled frames are updated to the timestamp of a labelled frame using the predicted velocities and the distances between the updated OBBs of the unlabelled frame and the OBB predictions of the labelled frame are used to generate a self-supervised training signal for the velocities. The detection network architecture is extended by a module to account for the temporal relation of multiple scans and a module to represent the radars' radial velocity measurements explicitly. A two-step approach of first training only OBB detection, followed by training OBB detection and velocities is used. Further, a pre-training with pseudo-labels generated from radar radial velocity measurements bootstraps the self-supervised method of this paper. Experiments on the publicly available nuScenes dataset show that the proposed method almost reaches the velocity estimation performance of a fully supervised training, but does not require expensive velocity labels. Furthermore, we outperform a baseline method which uses only radial velocity measurements as labels.
    Variational Nearest Neighbor Gaussian Process. (arXiv:2202.01694v3 [cs.LG] UPDATED)
    Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.
    Multi-objective Optimization of Notifications Using Offline Reinforcement Learning. (arXiv:2207.03029v1 [cs.LG])
    Mobile notification systems play a major role in a variety of applications to communicate, send alerts and reminders to the users to inform them about news, events or messages. In this paper, we formulate the near-real-time notification decision problem as a Markov Decision Process where we optimize for multiple objectives in the rewards. We propose an end-to-end offline reinforcement learning framework to optimize sequential notification decisions. We address the challenge of offline learning using a Double Deep Q-network method based on Conservative Q-learning that mitigates the distributional shift problem and Q-value overestimation. We illustrate our fully-deployed system and demonstrate the performance and benefits of the proposed approach through both offline and online experiments.
    HE-PEx: Efficient Machine Learning under Homomorphic Encryption using Pruning, Permutation and Expansion. (arXiv:2207.03384v1 [cs.CR])
    Privacy-preserving neural network (NN) inference solutions have recently gained significant traction with several solutions that provide different latency-bandwidth trade-offs. Of these, many rely on homomorphic encryption (HE), a method of performing computations over encrypted data. However, HE operations even with state-of-the-art schemes are still considerably slow compared to their plaintext counterparts. Pruning the parameters of a NN model is a well-known approach to improving inference latency. However, pruning methods that are useful in the plaintext context may lend nearly negligible improvement in the HE case, as has also been demonstrated in recent work. In this work, we propose a novel set of pruning methods that reduce the latency and memory requirement, thus bringing the effectiveness of plaintext pruning methods to HE. Crucially, our proposal employs two key techniques, viz. permutation and expansion of the packed model weights, that enable pruning significantly more ciphertexts and recuperating most of the accuracy loss, respectively. We demonstrate the advantage of our method on fully connected layers where the weights are packed using a recently proposed packing technique called tile tensors, which allows executing deep NN inference in a non-interactive mode. We evaluate our methods on various autoencoder architectures and demonstrate that for a small mean-square reconstruction loss of 1.5*10^{-5} on MNIST, we reduce the memory requirement and latency of HE-enabled inference by 60%.
    Boosting the interpretability of clinical risk scores with intervention predictions. (arXiv:2207.02941v1 [cs.LG])
    Machine learning systems show significant promise for forecasting patient adverse events via risk scores. However, these risk scores implicitly encode assumptions about future interventions that the patient is likely to receive, based on the intervention policy present in the training data. Without this important context, predictions from such systems are less interpretable for clinicians. We propose a joint model of intervention policy and adverse event risk as a means to explicitly communicate the model's assumptions about future interventions. We develop such an intervention policy model on MIMIC-III, a real world de-identified ICU dataset, and discuss some use cases that highlight the utility of this approach. We show how combining typical risk scores, such as the likelihood of mortality, with future intervention probability scores leads to more interpretable clinical predictions.
    Harnessing Out-Of-Distribution Examples via Augmenting Content and Style. (arXiv:2207.03162v1 [cs.LG])
    Machine learning models are vulnerable to Out-Of-Distribution (OOD) examples, such a problem has drawn much attention. However, current methods lack a full understanding of different types of OOD data: there are benign OOD data that can be properly adapted to enhance the learning performance, while other malign OOD data would severely degenerate the classification result. To Harness OOD data, this paper proposes HOOD method that can leverage the content and style from each image instance to identify benign and malign OOD data. Particularly, we design a variational inference framework to causally disentangle content and style features by constructing a structural causal model. Subsequently, we augment the content and style through an intervention process to produce malign and benign OOD data, respectively. The benign OOD data contain novel styles but hold our interested contents, and they can be leveraged to help train a style-invariant model. In contrast, the malign OOD data inherit unknown contents but carry familiar styles, by detecting them can improve model robustness against deceiving anomalies. Thanks to the proposed novel disentanglement and data augmentation techniques, HOOD can effectively deal with OOD examples in unknown and open environments, whose effectiveness is empirically validated in three typical OOD applications including OOD detection, open-set semi-supervised learning, and open-set domain adaptation.
    Shell Language Processing: Unix command parsing for Machine Learning. (arXiv:2107.02438v3 [cs.LG] UPDATED)
    In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed at parsing Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples of when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1 score from 0.392 to 0.874.
    Automating the Design and Development of Gradient Descent Trained Expert System Networks. (arXiv:2207.02845v1 [cs.LG])
    Prior work introduced a gradient descent trained expert system that conceptually combines the learning capabilities of neural networks with the understandability and defensible logic of an expert system. This system was shown to be able to learn patterns from data and to perform decision-making at levels rivaling those reported by neural network systems. The principal limitation of the approach, though, was the necessity for the manual development of a rule-fact network (which is then trained using backpropagation). This paper proposes a technique for overcoming this significant limitation, as compared to neural networks. Specifically, this paper proposes the use of larger and denser-than-application need rule-fact networks which are trained, pruned, manually reviewed and then re-trained for use. Multiple types of networks are evaluated under multiple operating conditions and these results are presented and assessed. Based on these individual experimental condition assessments, the proposed technique is evaluated. The data presented shows that error rates as low as 3.9% (mean, 1.2% median) can be obtained, demonstrating the efficacy of this technique for many applications.
    Betty: An Automatic Differentiation Library for Multilevel Optimization. (arXiv:2207.02849v1 [cs.LG])
    Multilevel optimization has been widely adopted as a mathematical foundation for a myriad of machine learning problems, such as hyperparameter optimization, meta-learning, and reinforcement learning, to name a few. Nonetheless, implementing multilevel optimization programs oftentimes requires expertise in both mathematics and programming, stunting research in this field. We take an initial step towards closing this gap by introducing Betty, a high-level software library for gradient-based multilevel optimization. To this end, we develop an automatic differentiation procedure based on a novel interpretation of multilevel optimization as a dataflow graph. We further abstract the main components of multilevel optimization as Python classes, to enable easy, modular, and maintainable programming. We empirically demonstrate that Betty can be used as a high-level programming interface for an array of multilevel optimization programs, while also observing up to 11\% increase in test accuracy, 14\% decrease in GPU memory usage, and 20\% decrease in wall time over existing implementations on multiple benchmarks. The code is available at this http URL .
    Cardiomegaly Detection using Deep Convolutional Neural Network with U-Net. (arXiv:2205.11515v2 [eess.IV] UPDATED)
    Cardiomegaly is indeed a medical disease in which the heart is enlarged. Cardiomegaly is better to handle if caught early, so early detection is critical. The chest X-ray, being one of the most often used radiography examinations, has been used to detect and visualize abnormalities of human organs for decades. X-ray is also a significant medical diagnosis tool for cardiomegaly. Even for domain experts, distinguishing the many types of diseases from the X-ray is a difficult and time-consuming task. Deep learning models are also most effective when used on huge data sets, yet due to privacy concerns, large datasets are rarely available inside the medical industry. A Deep learning-based customized retrained U-Net model for detecting Cardiomegaly disease is presented in this research. In the training phase, chest X-ray images from the "ChestX-ray8" open source real dataset are used. To reduce computing time, this model performs data preprocessing, picture improvement, image compression, and classification before moving on to the training step. The work used a chest x-ray image dataset to simulate and produced a diagnostic accuracy of 94%, a sensitivity of 96.2 percent, and a specificity of 92.5 percent, which beats prior pre-trained model findings for identifying Cardiomegaly disease.
    Reward is enough for convex MDPs. (arXiv:2106.00661v3 [cs.AI] UPDATED)
    Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called `pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) `players', using Fenchel duality. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
    Federated Robustness Propagation: Sharing Robustness in Heterogeneous Federated Learning. (arXiv:2106.10196v2 [cs.LG] UPDATED)
    Federated learning (FL) emerges as a popular distributed learning schema that learns a model from a set of participating users without sharing raw data. One major challenge of FL comes with heterogeneous users, who may have distributionally different (or non-iid) data and varying computation resources. As federated users would use the model for prediction, they often demand the trained model to be robust against malicious attackers at test time. Whereas adversarial training (AT) provides a sound solution for centralized learning, extending its usage for federated users has imposed significant challenges, as many users may have very limited training data and tight computational budgets, to afford the data-hungry and costly AT. In this paper, we study a novel FL strategy: propagating adversarial robustness from rich-resource users that can afford AT, to those with poor resources that cannot afford it, during federated learning. We show that existing FL techniques cannot be effectively integrated with the strategy to propagate robustness among non-iid users and propose an efficient propagation approach by the proper use of batch-normalization. We demonstrate the rationality and effectiveness of our method through extensive experiments. Especially, the proposed method is shown to grant federated models remarkable robustness even when only a small portion of users afford AT during learning. Source code will be released.
    The Multivariate Community Hawkes Model for Dependent Relational Events in Continuous-time Networks. (arXiv:2205.00639v2 [stat.ME] UPDATED)
    The stochastic block model (SBM) is one of the most widely used generative models for network data. Many continuous-time dynamic network models are built upon the same assumption as the SBM: edges or events between all pairs of nodes are conditionally independent given the block or community memberships, which prevents them from reproducing higher-order motifs such as triangles that are commonly observed in real networks. We propose the multivariate community Hawkes (MULCH) model, an extremely flexible community-based model for continuous-time networks that introduces dependence between node pairs using structured multivariate Hawkes processes. We fit the model using a spectral clustering and likelihood-based local refinement procedure. We find that our proposed MULCH model is far more accurate than existing models both for predictive and generative tasks.
    Unsupervised Manifold Alignment with Joint Multidimensional Scaling. (arXiv:2207.02968v1 [stat.ML])
    We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment.
    Virtual staining of defocused autofluorescence images of unlabeled tissue using deep neural networks. (arXiv:2207.02946v1 [eess.IV])
    Deep learning-based virtual staining was developed to introduce image contrast to label-free tissue sections, digitally matching the histological staining, which is time-consuming, labor-intensive, and destructive to tissue. Standard virtual staining requires high autofocusing precision during the whole slide imaging of label-free tissue, which consumes a significant portion of the total imaging time and can lead to tissue photodamage. Here, we introduce a fast virtual staining framework that can stain defocused autofluorescence images of unlabeled tissue, achieving equivalent performance to virtual staining of in-focus label-free images, also saving significant imaging time by lowering the microscope's autofocusing precision. This framework incorporates a virtual-autofocusing neural network to digitally refocus the defocused images and then transforms the refocused images into virtually stained images using a successive network. These cascaded networks form a collaborative inference scheme: the virtual staining model regularizes the virtual-autofocusing network through a style loss during the training. To demonstrate the efficacy of this framework, we trained and blindly tested these networks using human lung tissue. Using 4x fewer focus points with 2x lower focusing precision, we successfully transformed the coarsely-focused autofluorescence images into high-quality virtually stained H&E images, matching the standard virtual staining framework that used finely-focused autofluorescence input images. Without sacrificing the staining quality, this framework decreases the total image acquisition time needed for virtual staining of a label-free whole-slide image (WSI) by ~32%, together with a ~89% decrease in the autofocusing time, and has the potential to eliminate the laborious and costly histochemical staining process in pathology.
    Efficient Self-supervised Vision Transformers for Representation Learning. (arXiv:2106.09785v2 [cs.CV] UPDATED)
    This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models are publicly available: https://github.com/microsoft/esvit
    BioLCNet: Reward-modulated Locally Connected Spiking Neural Networks. (arXiv:2109.05539v5 [cs.NE] UPDATED)
    Brain-inspired computation and information processing alongside compatibility with neuromorphic hardware have made spiking neural networks (SNN) a promising method for solving learning tasks in machine learning (ML). Spiking neurons are only one of the requirements for building a bio-plausible learning model. Network architecture and learning rules are other important factors to consider when developing such artificial agents. In this work, inspired by the human visual pathway and the role of dopamine in learning, we propose a reward-modulated locally connected spiking neural network, BioLCNet, for visual learning tasks. To extract visual features from Poisson-distributed spike trains, we used local filters that are more analogous to the biological visual system compared to convolutional filters with weight sharing. In the decoding layer, we applied a spike population-based voting scheme to determine the decision of the network. We employed Spike-timing-dependent plasticity (STDP) for learning the visual features, and its reward-modulated variant (R-STDP) for training the decoder based on the reward or punishment feedback signal. For evaluation, we first assessed the robustness of our rewarding mechanism to varying target responses in a classical conditioning experiment. Afterwards, we evaluated the performance of our network on image classification tasks of MNIST and XOR MNIST datasets.
    Comprehensive Analysis of Negative Sampling in Knowledge Graph Representation Learning. (arXiv:2206.10140v2 [cs.LG] UPDATED)
    Negative sampling (NS) loss plays an important role in learning knowledge graph embedding (KGE) to handle a huge number of entities. However, the performance of KGE degrades without hyperparameters such as the margin term and number of negative samples in NS loss being appropriately selected. Currently, empirical hyperparameter tuning addresses this problem at the cost of computational time. To solve this problem, we theoretically analyzed NS loss to assist hyperparameter tuning and understand the better use of the NS loss in KGE learning. Our theoretical analysis showed that scoring methods with restricted value ranges, such as TransE and RotatE, require appropriate adjustment of the margin term or the number of negative samples different from those without restricted value ranges, such as RESCAL, ComplEx, and DistMult. We also propose subsampling methods specialized for the NS loss in KGE studied from a theoretical aspect. Our empirical analysis on the FB15k-237, WN18RR, and YAGO3-10 datasets showed that the results of actually trained models agree with our theoretical findings.
    Machine Learning to Predict Aerodynamic Stall. (arXiv:2207.03424v1 [physics.flu-dyn])
    A convolutional autoencoder is trained using a database of airfoil aerodynamic simulations and assessed in terms of overall accuracy and interpretability. The goal is to predict the stall and to investigate the ability of the autoencoder to distinguish between the linear and non-linear response of the airfoil pressure distribution to changes in the angle of attack. After a sensitivity analysis on the learning infrastructure, we investigate the latent space identified by the autoencoder targeting extreme compression rates, i.e. very low-dimensional reconstructions. We also propose a strategy to use the decoder to generate new synthetic airfoil geometries and aerodynamic solutions by interpolation and extrapolation in the latent representation learned by the autoencoder.
    Building Machine Translation Systems for the Next Thousand Languages. (arXiv:2205.03983v3 [cs.CL] UPDATED)
    In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
    Exploring Runtime Decision Support for Trauma Resuscitation. (arXiv:2207.02922v1 [cs.AI])
    AI-based recommender systems have been successfully applied in many domains (e.g., e-commerce, feeds ranking). Medical experts believe that incorporating such methods into a clinical decision support system may help reduce medical team errors and improve patient outcomes during treatment processes (e.g., trauma resuscitation, surgical processes). Limited research, however, has been done to develop automatic data-driven treatment decision support. We explored the feasibility of building a treatment recommender system to provide runtime next-minute activity predictions. The system uses patient context (e.g., demographics and vital signs) and process context (e.g., activities) to continuously predict activities that will be performed in the next minute. We evaluated our system on a pre-recorded dataset of trauma resuscitation and conducted an ablation study on different model variants. The best model achieved an average F1-score of 0.67 for 61 activity types. We include medical team feedback and discuss the future work.
    Perfusion imaging in deep prostate cancer detection from mp-MRI: can we take advantage of it?. (arXiv:2207.02854v1 [eess.IV])
    To our knowledge, all deep computer-aided detection and diagnosis (CAD) systems for prostate cancer (PCa) detection consider bi-parametric magnetic resonance imaging (bp-MRI) only, including T2w and ADC sequences while excluding the 4D perfusion sequence,which is however part of standard clinical protocols for this diagnostic task. In this paper, we question strategies to integrate information from perfusion imaging in deep neural architectures. To do so, we evaluate several ways to encode the perfusion information in a U-Net like architecture, also considering early versus mid fusion strategies. We compare performance of multiparametric MRI (mp-MRI) models with the baseline bp-MRI model based on a private dataset of 219 mp-MRI exams. Perfusion maps derived from dynamic contrast enhanced MR exams are shown to positively impact segmentation and grading performance of PCa lesions, especially the 3D MR volume corresponding to the maximum slope of the wash-in curve as well as Tmax perfusion maps. The latter mp-MRI models indeed outperform the bp-MRI one whatever the fusion strategy, with Cohen's kappa score of 0.318$\pm$0.019 for the bp-MRI model and 0.378 $\pm$ 0.033 for the model including the maximum slope with a mid fusion strategy, also achieving competitive Cohen's kappa score compared to state of the art.
    Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain. (arXiv:2203.17004v2 [eess.AS] UPDATED)
    Score-based generative models (SGMs) have recently shown impressive results for difficult generative tasks such as the unconditional and conditional generation of natural images and audio signals. In this work, we extend these models to the complex short-time Fourier transform (STFT) domain, proposing a novel training task for speech enhancement using a complex-valued deep neural network. We derive this training task within the formalism of stochastic differential equations (SDEs), thereby enabling the use of predictor-corrector samplers. We provide alternative formulations inspired by previous publications on using generative diffusion models for speech enhancement, avoiding the need for any prior assumptions on the noise distribution and making the training task purely generative which, as we show, results in improved enhancement performance.
    Signed Link Representation in Continuous-Time Dynamic Signed Networks. (arXiv:2207.03408v1 [cs.SI])
    Signed networks allow us to model bi-faceted relationships and interactions, such as friend/enemy, support/oppose, etc. These interactions are often temporal in real datasets, where nodes and edges appear over time. Learning the dynamics of signed networks is thus crucial to effectively predict the sign and strength of future links. Existing works model either signed networks or dynamic networks but not both together. In this work, we study dynamic signed networks where links are both signed and evolving with time. Our model learns a Signed link's Evolution using Memory modules and Balanced Aggregation (hence, the name SEMBA). Each node maintains two separate memory encodings for positive and negative interactions. On the arrival of a new edge, each interacting node aggregates this signed information with its memories while exploiting balance theory. Node embeddings are generated using updated memories, which are then used to train for multiple downstream tasks, including link sign prediction and link weight prediction. Our results show that SEMBA outperforms all the baselines on the task of sign prediction by achieving up to an 8% increase in the AUC and up to a 50% reduction in FPR. Results on the task of predicting signed weights show that SEMBA reduces the mean squared error by 9% while achieving up to 69% reduction in the KL-divergence on the distribution of predicted signed weights.
    Efficient fine-grained road segmentation using superpixel-based CNN and CRF models. (arXiv:2207.02844v1 [cs.CV])
    Towards a safe and comfortable driving, road scene segmentation is a rudimentary problem in camera-based advance driver assistance systems (ADAS). Despite of the great achievement of Convolutional Neural Networks (CNN) for semantic segmentation task, the high computational efforts of CNN based methods is still a challenging area. In recent work, we proposed a novel approach to utilise the advantages of CNNs for the task of road segmentation at reasonable computational effort. The runtime benefits from using irregular super pixels as basis for the input for the CNN rather than the image grid, which tremendously reduces the input size. Although, this method achieved remarkable low computational time in both training and testing phases, the lower resolution of the super pixel domain yields naturally lower accuracy compared to high cost state of the art methods. In this work, we focus on a refinement of the road segmentation utilising a Conditional Random Field (CRF).The refinement procedure is limited to the super pixels touching the predicted road boundary to keep the additional computational effort low. Reducing the input to the super pixel domain allows the CNNs structure to stay small and efficient to compute while keeping the advantage of convolutional layers and makes them eligible for ADAS. Applying CRF compensate the trade off between accuracy and computational efficiency. The proposed system obtained comparable performance among the top performing algorithms on the KITTI road benchmark and its fast inference makes it particularly suitable for realtime applications.
    Algebraic and machine learning approach to hierarchical triple-star stability. (arXiv:2207.03151v1 [astro-ph.SR])
    We present two approaches to determine the dynamical stability of a hierarchical triple-star system. The first is an improvement on the semi-analytical stability criterion of Mardling & Aarseth (2001), where we introduce a dependence on inner orbital eccentricity and improve the dependence on mutual orbital inclination. The second involves a machine learning approach, where we use a multilayer perceptron (MLP) to classify triple-star systems as `stable' and `unstable'. To achieve this, we generate a large training data set of 10^6 hierarchical triples using the N-body code MSTAR. Both our approaches perform better than the original Mardling & Aarseth (2001) stability criterion, with the MLP model performing the best. The improved stability formula and the machine learning model have overall classification accuracies of 93 % and 95 % respectively. Our MLP model, which accurately predicts the stability of any hierarchical triple-star system within the parameter ranges studied with almost no computation required, is publicly available on Github in the form of an easy-to-use Python script.
    Multi-scale Sinusoidal Embeddings Enable Learning on High Resolution Mass Spectrometry Data. (arXiv:2207.02980v1 [cs.LG])
    Small molecules in biological samples are studied to provide information about disease states, environmental toxins, natural product drug discovery, and many other applications. The primary window into the composition of small molecule mixtures is tandem mass spectrometry (MS2), which produces data that are of high sensitivity and part per million resolution. We adopt multi-scale sinusoidal embeddings of the mass data in MS2 designed to meet the challenge of learning from the full resolution of MS2 data. Using these embeddings, we provide a new state of the art model for spectral library search, the standard task for initial evaluation of MS2 data. We also introduce a new task, chemical property prediction from MS2 data, that has natural applications in high-throughput MS2 experiments and show that an average $R^2$ of 80\% for novel compounds can be achieved across 10 chemical properties prioritized by medicinal chemists. We use dimensionality reduction techniques and experiments with different floating point resolutions to show the essential role multi-scale sinusoidal embeddings play in learning from MS2 data.
    Machine Learning Model Sizes and the Parameter Gap. (arXiv:2207.02852v1 [cs.LG])
    We study trends in model size of notable machine learning systems over time using a curated dataset. From 1950 to 2018, model size in language models increased steadily by seven orders of magnitude. The trend then accelerated, with model size increasing by another five orders of magnitude in just 4 years from 2018 to 2022. Vision models grew at a more constant pace, totaling 7 orders of magnitude of growth between 1950 and 2022. We also identify that, since 2020, there have been many language models below 20B parameters, many models above 70B parameters, but a scarcity of models in the 20-70B parameter range. We refer to that scarcity as the parameter gap. We provide some stylized facts about the parameter gap and propose a few hypotheses to explain it. The explanations we favor are: (a) increasing model size beyond 20B parameters requires adopting different parallelism techniques, which makes mid-sized models less cost-effective, (b) GPT-3 was one order of magnitude larger than previous language models, and researchers afterwards primarily experimented with bigger models to outperform it. While these dynamics likely exist, and we believe they play some role in generating the gap, we don't have high confidence that there are no other, more important dynamics at play.
    Diagnosing and Remedying Shot Sensitivity with Cosine Few-Shot Learners. (arXiv:2207.03398v1 [cs.CV])
    Few-shot recognition involves training an image classifier to distinguish novel concepts at test time using few examples (shot). Existing approaches generally assume that the shot number at test time is known in advance. This is not realistic, and the performance of a popular and foundational method has been shown to suffer when train and test shots do not match. We conduct a systematic empirical study of this phenomenon. In line with prior work, we find that shot sensitivity is broadly present across metric-based few-shot learners, but in contrast to prior work, larger neural architectures provide a degree of built-in robustness to varying test shot. More importantly, a simple, previously known but greatly overlooked class of approaches based on cosine distance consistently and greatly improves robustness to shot variation, by removing sensitivity to sample noise. We derive cosine alternatives to popular and recent few-shot classifiers, broadening their applicability to realistic settings. These cosine models consistently improve shot-robustness, outperform prior shot-robust state of the art, and provide competitive accuracy on a range of benchmarks and architectures, including notable gains in the very-low-shot regime.
    Toward Force Estimation in Robot-Assisted Surgery using Deep Learning with Vision and Robot State. (arXiv:2011.02112v4 [cs.RO] UPDATED)
    Knowledge of interaction forces during teleoperated robot-assisted surgery could be used to enable force feedback to human operators and evaluate tissue handling skill. However, direct force sensing at the end-effector is challenging because it requires biocompatible, sterilizable, and cost-effective sensors. Vision-based deep learning using convolutional neural networks is a promising approach for providing useful force estimates, though questions remain about generalization to new scenarios and real-time inference. We present a force estimation neural network that uses RGB images and robot state as inputs. Using a self-collected dataset, we compared the network to variants that included only a single input type, and evaluated how they generalized to new viewpoints, workspace positions, materials, and tools. We found that vision-based networks were sensitive to shifts in viewpoints, while state-only networks were robust to changes in workspace. The network with both state and vision inputs had the highest accuracy for an unseen tool, and was moderately robust to changes in viewpoints. Through feature removal studies, we found that using only position features produced better accuracy than using only force features as input. The network with both state and vision inputs outperformed a physics-based baseline model in accuracy. It showed comparable accuracy but faster computation times than a baseline recurrent neural network, making it better suited for real-time applications.
    Human-Robot Commensality: Bite Timing Prediction for Robot-Assisted Feeding in Groups. (arXiv:2207.03348v1 [cs.RO])
    We develop data-driven models to predict when a robot should feed during social dining scenarios. Being able to eat independently with friends and family is considered one of the most memorable and important activities for people with mobility limitations. Robots can potentially help with this activity but robot-assisted feeding is a multi-faceted problem with challenges in bite acquisition, bite timing, and bite transfer. Bite timing in particular becomes uniquely challenging in social dining scenarios due to the possibility of interrupting a social human-robot group interaction during commensality. Our key insight is that bite timing strategies that take into account the delicate balance of social cues can lead to seamless interactions during robot-assisted feeding in a social dining scenario. We approach this problem by collecting a multimodal Human-Human Commensality Dataset (HHCD) containing 30 groups of three people eating together. We use this dataset to analyze human-human commensality behaviors and develop bite timing prediction models in social dining scenarios. We also transfer these models to human-robot commensality scenarios. Our user studies show that prediction improves when our algorithm uses multimodal social signaling cues between diners to model bite timing. The HHCD dataset, videos of user studies, and code will be publicly released after acceptance.
    Speech Emotion: Investigating Model Representations, Multi-Task Learning and Knowledge Distillation. (arXiv:2207.03334v1 [eess.AS])
    Estimating dimensional emotions, such as activation, valence and dominance, from acoustic speech signals has been widely explored over the past few years. While accurate estimation of activation and dominance from speech seem to be possible, the same for valence remains challenging. Previous research has shown that the use of lexical information can improve valence estimation performance. Lexical information can be obtained from pre-trained acoustic models, where the learned representations can improve valence estimation from speech. We investigate the use of pre-trained model representations to improve valence estimation from acoustic speech signal. We also explore fusion of representations to improve emotion estimation across all three emotion dimensions: activation, valence and dominance. Additionally, we investigate if representations from pre-trained models can be distilled into models trained with low-level features, resulting in models with a less number of parameters. We show that fusion of pre-trained model embeddings result in a 79% relative improvement in concordance correlation coefficient CCC on valence estimation compared to standard acoustic feature baseline (mel-filterbank energies), while distillation from pre-trained model embeddings to lower-dimensional representations yielded a relative 12% improvement. Such performance gains were observed over two evaluation sets, indicating that our proposed architecture generalizes across those evaluation sets. We report new state-of-the-art "text-free" acoustic-only dimensional emotion estimation $CCC$ values on two MSP-Podcast evaluation sets.
    DLME: Deep Local-flatness Manifold Embedding. (arXiv:2207.03160v1 [cs.LG])
    Manifold learning~(ML) aims to find low-dimensional embedding from high-dimensional data. Previous works focus on handcraft or easy datasets with simple and ideal scenarios; however, we find they perform poorly on real-world datasets with under-sampling data. Generally, ML methods primarily model data structure and subsequently process a low-dimensional embedding, where the poor local connectivity of under-sampling data in the former step and inappropriate optimization objectives in the later step will lead to \emph{structural distortion} and \emph{underconstrained embedding}. To solve this problem, we propose Deep Local-flatness Manifold Embedding (DLME), a novel ML framework to obtain reliable manifold embedding by reducing distortion. Our proposed DLME constructs semantic manifolds by data augmentation and overcomes \emph{structural distortion} problems with the help of its smooth framework. To overcome \emph{underconstrained embedding}, we design a specific loss for DLME and mathematically demonstrate that it leads to a more suitable embedding based on our proposed Local Flatness Assumption. In the experiments, by showing the effectiveness of DLME on downstream classification, clustering, and visualization tasks with three types of datasets (toy, biological, and image), our experimental results show that DLME outperforms SOTA ML \& contrastive learning (CL) methods.
    Backpropagation on Dynamical Networks. (arXiv:2207.03093v1 [math.DS])
    Dynamical networks are versatile models that can describe a variety of behaviours such as synchronisation and feedback. However, applying these models in real world contexts is difficult as prior information pertaining to the connectivity structure or local dynamics is often unknown and must be inferred from time series observations of network states. Additionally, the influence of coupling interactions between nodes further complicates the isolation of local node dynamics. Given the architectural similarities between dynamical networks and recurrent neural networks (RNN), we propose a network inference method based on the backpropagation through time (BPTT) algorithm commonly used to train recurrent neural networks. This method aims to simultaneously infer both the connectivity structure and local node dynamics purely from observation of node states. An approximation of local node dynamics is first constructed using a neural network. This is alternated with an adapted BPTT algorithm to regress corresponding network weights by minimising prediction errors of the dynamical network based on the previously constructed local models until convergence is achieved. This method was found to be succesful in identifying the connectivity structure for coupled networks of Lorenz, Chua and FitzHugh-Nagumo oscillators. Freerun prediction performance with the resulting local models and weights was found to be comparable to the true system with noisy initial conditions. The method is also extended to non-conventional network couplings such as asymmetric negative coupling.
    DRL-ISP: Multi-Objective Camera ISP with Deep Reinforcement Learning. (arXiv:2207.03081v1 [cs.CV])
    In this paper, we propose a multi-objective camera ISP framework that utilizes Deep Reinforcement Learning (DRL) and camera ISP toolbox that consist of network-based and conventional ISP tools. The proposed DRL-based camera ISP framework iteratively selects a proper tool from the toolbox and applies it to the image to maximize a given vision task-specific reward function. For this purpose, we implement total 51 ISP tools that include exposure correction, color-and-tone correction, white balance, sharpening, denoising, and the others. We also propose an efficient DRL network architecture that can extract the various aspects of an image and make a rigid mapping relationship between images and a large number of actions. Our proposed DRL-based ISP framework effectively improves the image quality according to each vision task such as RAW-to-RGB image restoration, 2D object detection, and monocular depth estimation.
    Group Fairness in Adaptive Submodular Maximization. (arXiv:2207.03364v1 [cs.LG])
    In this paper, we study the classic submodular maximization problem subject to a group fairness constraint under both non-adaptive and adaptive settings. It has been shown that the utility function of many machine learning applications, including data summarization, influence maximization in social networks, and personalized recommendation, satisfies the property of submodularity. Hence, maximizing a submodular function subject to various constraints can be found at the heart of many of those applications. On a high level, submodular maximization aims to select a group of most representative items (e.g., data points). However, the design of most existing algorithms does not incorporate the fairness constraint, leading to under- or over-representation some particular groups. This motivates us to study the fair submodular maximization problem, where we aim to select a group of items to maximize a (possibly non-monotone) submodular utility function subject to a group fairness constraint. To this end, we develop the first constant-factor approximation algorithm for this problem. The design of our algorithm is robust enough to be extended to solving the submodular maximization problem under a more complicated adaptive setting. Moreover, we further extend our study to incorporating a global cardinality constraint.
    Causality-based Neural Network Repair. (arXiv:2204.09274v2 [cs.SE] UPDATED)
    Neural networks have had discernible achievements in a wide range of applications. The wide-spread adoption also raises the concern of their dependability and reliability. Similar to traditional decision-making programs, neural networks can have defects that need to be repaired. The defects may cause unsafe behaviors, raise security concerns or unjust societal impacts. In this work, we address the problem of repairing a neural network for desirable properties such as fairness and the absence of backdoor. The goal is to construct a neural network that satisfies the property by (minimally) adjusting the given neural network's parameters (i.e., weights). Specifically, we propose CARE (\textbf{CA}usality-based \textbf{RE}pair), a causality-based neural network repair technique that 1) performs causality-based fault localization to identify the `guilty' neurons and 2) optimizes the parameters of the identified neurons to reduce the misbehavior. We have empirically evaluated CARE on various tasks such as backdoor removal, neural network repair for fairness and safety properties. Our experiment results show that CARE is able to repair all neural networks efficiently and effectively. For fairness repair tasks, CARE successfully improves fairness by $61.91\%$ on average. For backdoor removal tasks, CARE reduces the attack success rate from over $98\%$ to less than $1\%$. For safety property repair tasks, CARE reduces the property violation rate to less than $1\%$. Results also show that thanks to the causality-based fault localization, CARE's repair focuses on the misbehavior and preserves the accuracy of the neural networks.
    DecisioNet -- A Binary-Tree Structured Neural Network. (arXiv:2207.01127v2 [cs.CV] UPDATED)
    Deep neural networks (DNNs) and decision trees (DTs) are both state-of-the-art classifiers. DNNs perform well due to their representational learning capabilities, while DTs are computationally efficient as they perform inference along one route (root-to-leaf) that is dependent on the input data. In this paper, we present DecisioNet (DN), a binary-tree structured neural network. We propose a systematic way to convert an existing DNN into a DN to create a lightweight version of the original model. DecisioNet takes the best of both worlds - it uses neural modules to perform representational learning and utilizes its tree structure to perform only a portion of the computations. We evaluate various DN architectures, along with their corresponding baseline models on the FashionMNIST, CIFAR10, and CIFAR100 datasets. We show that the DN variants achieve similar accuracy while significantly reducing the computational cost of the original network.
    FewSOL: A Dataset for Few-Shot Object Learning in Robotic Environments. (arXiv:2207.03333v1 [cs.CV])
    We introduce the Few-Shot Object Learning (FewSOL) dataset for object recognition with a few images per object. We captured 336 real-world objects with 9 RGB-D images per object from different views. Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. We investigated (i) few-shot object classification and (ii) joint object segmentation and few-shot classification with the state-of-the-art methods for few-shot learning and meta-learning using our dataset. The evaluation results show that there is still a large margin to be improved for few-shot object classification in robotic environments. Our dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and attribute recognition. The dataset and code are available at https://irvlutd.github.io/FewSOL.
    Red PANDA: Disambiguating Anomaly Detection by Removing Nuisance Factors. (arXiv:2207.03478v1 [cs.CV])
    Anomaly detection methods strive to discover patterns that differ from the norm in a semantic way. This goal is ambiguous as a data point differing from the norm by an attribute e.g., age, race or gender, may be considered anomalous by some operators while others may consider this attribute irrelevant. Breaking from previous research, we present a new anomaly detection method that allows operators to exclude an attribute from being considered as relevant for anomaly detection. Our approach then learns representations which do not contain information over the nuisance attributes. Anomaly scoring is performed using a density-based approach. Importantly, our approach does not require specifying the attributes that are relevant for detecting anomalies, which is typically impossible in anomaly detection, but only attributes to ignore. An empirical investigation is presented verifying the effectiveness of our approach.
    SSLGuard: A Watermarking Scheme for Self-supervised Learning Pre-trained Encoders. (arXiv:2201.11692v3 [cs.CR] UPDATED)
    Self-supervised learning is an emerging machine learning (ML) paradigm. Compared to supervised learning which leverages high-quality labeled datasets to achieve good performance, self-supervised learning relies on unlabeled datasets to pre-train powerful encoders which can then be treated as feature extractors for various downstream tasks. The huge amount of data and computational resources consumption makes the encoders themselves become valuable intellectual property of the model owner. Recent research has shown that the ML model's copyright is threatened by model stealing attacks, which aim to train a surrogate model to mimic the behavior of a given model. We empirically show that pre-trained encoders are highly vulnerable to model stealing attacks. However, most of the current efforts of copyright protection algorithms such as watermarking concentrate on classifiers. Meanwhile, the intrinsic challenges of pre-trained encoder's copyright protection remain largely unstudied. We fill the gap by proposing SSLGuard, the first watermarking algorithm for pre-trained encoders. Given a clean pre-trained encoder, SSLGuard injects a watermark into it and outputs a watermarked version. The shadow training technique is also applied to preserve the watermark under potential model stealing attacks. Our extensive evaluation shows that SSLGuard is effective in watermark injection and verification, and is robust against model stealing and other watermark removal attacks such as input noising, output perturbing, overwriting, model pruning, and fine-tuning.
    Brainish: Formalizing A Multimodal Language for Intelligence and Consciousness. (arXiv:2205.00001v3 [cs.AI] UPDATED)
    Having a rich multimodal inner language is an important component of human intelligence that enables several necessary core cognitive functions such as multimodal prediction, translation, and generation. Building upon the Conscious Turing Machine (CTM), a machine model for consciousness proposed by Blum and Blum (2021), we describe the desiderata of a multimodal language called Brainish, comprising words, images, audio, and sensations combined in representations that the CTM's processors use to communicate with each other. We define the syntax and semantics of Brainish before operationalizing this language through the lens of multimodal artificial intelligence, a vibrant research area studying the computational tools necessary for processing and relating information from heterogeneous signals. Our general framework for learning Brainish involves designing (1) unimodal encoders to segment and represent unimodal data, (2) a coordinated representation space that relates and composes unimodal features to derive holistic meaning across multimodal inputs, and (3) decoders to map multimodal representations into predictions (for fusion) or raw data (for translation or generation). Through discussing how Brainish is crucial for communication and coordination in order to achieve consciousness in the CTM, and by implementing a simple version of Brainish and evaluating its capability of demonstrating intelligence on multimodal prediction and retrieval tasks on several real-world image, text, and audio datasets, we argue that such an inner language will be important for advances in machine models of intelligence and consciousness.
    Tensor networks in machine learning. (arXiv:2207.02851v1 [quant-ph])
    A tensor network is a type of decomposition used to express and approximate large arrays of data. A given data-set, quantum state or higher dimensional multi-linear map is factored and approximated by a composition of smaller multi-linear maps. This is reminiscent to how a Boolean function might be decomposed into a gate array: this represents a special case of tensor decomposition, in which the tensor entries are replaced by 0, 1 and the factorisation becomes exact. The collection of associated techniques are called, tensor network methods: the subject developed independently in several distinct fields of study, which have more recently become interrelated through the language of tensor networks. The tantamount questions in the field relate to expressability of tensor networks and the reduction of computational overheads. A merger of tensor networks with machine learning is natural. On the one hand, machine learning can aid in determining a factorization of a tensor network approximating a data set. On the other hand, a given tensor network structure can be viewed as a machine learning model. Herein the tensor network parameters are adjusted to learn or classify a data-set. In this survey we recover the basics of tensor networks and explain the ongoing effort to develop the theory of tensor networks in machine learning.
    A Survey on Hyperlink Prediction. (arXiv:2207.02911v1 [cs.LG])
    As a natural extension of link prediction on graphs, hyperlink prediction aims for the inference of missing hyperlinks in hypergraphs, where a hyperlink can connect more than two nodes. Hyperlink prediction has applications in a wide range of systems, from chemical reaction networks, social communication networks, to protein-protein interaction networks. In this paper, we provide a systematic and comprehensive survey on hyperlink prediction. We propose a new taxonomy to classify existing hyperlink prediction methods into four categories: similarity-based, probability-based, matrix optimization-based, and deep learning-based methods. To compare the performance of methods from different categories, we perform a benchmark study on various hypergraph applications using representative methods from each category. Notably, deep learning-based methods prevail over other methods in hyperlink prediction.
    Provable Domain Generalization via Invariant-Feature Subspace Recovery. (arXiv:2201.12919v2 [cs.LG] UPDATED)
    Domain generalization asks for models trained over a set of training environments to perform well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) has been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this paper, we propose to achieve domain generalization with Invariant-feature Subspace Recovery (ISR). Our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments under the data model of Rosenfeld et al. (2021). Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Empirically, our ISRs can obtain superior performance compared with IRM on synthetic benchmarks. In addition, on three real-world image and text datasets, we show that both ISRs can be used as simple yet effective post-processing methods to improve the worst-case accuracy of (pre-)trained models against spurious correlations and group shifts.
    Robot Learning of Mobile Manipulation with Reachability Behavior Priors. (arXiv:2203.04051v3 [cs.RO] UPDATED)
    Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments. Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation. Reinforcement Learning (RL) holds the promise of endowing robots with adaptive behaviors, but most methods require prohibitively large amounts of data for learning a useful control policy. In this work, we study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks. Namely, we consider the problem of optimal base placement and the subsequent decision of whether to activate the arm for reaching a 6D target. For this, we devise a novel Hybrid RL method that handles discrete and continuous actions jointly, resorting to the Gumbel-Softmax reparameterization. Next, we train a reachability prior using data from the operational robot workspace, inspired by classical methods. Subsequently, we derive Boosted Hybrid RL (BHyRL), a novel algorithm for learning Q-functions by modeling them as a sum of residual approximators. Every time a new task needs to be learned, we can transfer our learned residuals and learn the component of the Q-function that is task-specific, hence, maintaining the task structure from prior behaviors. Moreover, we find that regularizing the target policy with a prior policy yields more expressive behaviors. We evaluate our method in simulation in reaching and fetching tasks of increasing difficulty, and we show the superior performance of BHyRL against baseline methods. Finally, we zero-transfer our learned 6D fetching policy with BHyRL to our MM robot TIAGo++. For more details and code release, please refer to our project site: irosalab.com/rlmmbp  ( 3 min )
    Characterizing player's playing styles based on Player Vectors for each playing position in the Chinese Football Super League. (arXiv:2205.02731v2 [cs.LG] UPDATED)
    Characterizing playing style is important for football clubs on scouting, monitoring and match preparation. Previous studies considered a player's style as a combination of technical performances, failing to consider the spatial information. Therefore, this study aimed to characterize the playing styles of each playing position in the Chinese Football Super League (CSL) matches, integrating a recently adopted Player Vectors framework. Data of 960 matches from 2016-2019 CSL were used. Match ratings, and ten types of match events with the corresponding coordinates for all the lineup players whose on-pitch time exceeded 45 minutes were extracted. Players were first clustered into 8 positions. A player vector was constructed for each player in each match based on the Player Vectors using Nonnegative Matrix Factorization (NMF). Another NMF process was run on the player vectors to extract different types of playing styles. The resulting player vectors discovered 18 different playing styles in the CSL. Six performance indicators of each style were investigated to observe their contributions. In general, the playing styles of forwards and midfielders are in line with football performance evolution trends, while the styles of defenders should be reconsidered. Multifunctional playing styles were also found in high rated CSL players.  ( 3 min )
    A domain-specific language for describing machine learning dataset. (arXiv:2207.02848v1 [cs.LG])
    Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift where data issues are given the attention they deserve, and more standard practices around the gathering and processing of datasets start to be discussed and established. So far, these proposals are mostly high-level guidelines described in natural language and, as such, they are difficult to formalize and apply to particular datasets. In this sense, and inspired by these proposals, we define a new domain-specific language (DSL) to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. We believe this DSL will facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The DSL is implemented as a Visual Studio Code plugin, and it has been published under an open source license.
    DeepAdversaries: Examining the Robustness of Deep Learning Models for Galaxy Morphology Classification. (arXiv:2112.14299v3 [cs.LG] UPDATED)
    With increased adoption of supervised deep learning methods for processing and analysis of cosmological survey data, the assessment of data perturbation effects (that can naturally occur in the data processing and analysis pipelines) and the development of methods that increase model robustness are increasingly important. In the context of morphological classification of galaxies, we study the effects of perturbations in imaging data. In particular, we examine the consequences of using neural networks when training on baseline data and testing on perturbed data. We consider perturbations associated with two primary sources: 1) increased observational noise as represented by higher levels of Poisson noise and 2) data processing noise incurred by steps such as image compression or telescope errors as represented by one-pixel adversarial attacks. We also test the efficacy of domain adaptation techniques in mitigating the perturbation-driven errors. We use classification accuracy, latent space visualizations, and latent space distance to assess model robustness. Without domain adaptation, we find that processing pixel-level errors easily flip the classification into an incorrect class and that higher observational noise makes the model trained on low-noise data unable to classify galaxy morphologies. On the other hand, we show that training with domain adaptation improves model robustness and mitigates the effects of these perturbations, improving the classification accuracy by 23% on data with higher observational noise. Domain adaptation also increases by a factor of ~2.3 the latent space distance between the baseline and the incorrectly classified one-pixel perturbed image, making the model more robust to inadvertent perturbations.  ( 3 min )
    Interpretable Deep Causal Learning for Moderation Effects. (arXiv:2206.10261v2 [cs.LG] UPDATED)
    In this extended abstract paper, we address the problem of interpretability and targeted regularization in causal machine learning models. In particular, we focus on the problem of estimating individual causal/treatment effects under observed confounders, which can be controlled for and moderate the effect of the treatment on the outcome of interest. Black-box ML models adjusted for the causal setting perform generally well in this task, but they lack interpretable output identifying the main drivers of treatment heterogeneity and their functional relationship. We propose a novel deep counterfactual learning architecture for estimating individual treatment effects that can simultaneously: i) convey targeted regularization on, and produce quantify uncertainty around the quantity of interest (i.e., the Conditional Average Treatment Effect); ii) disentangle baseline prognostic and moderating effects of the covariates and output interpretable score functions describing their relationship with the outcome. Finally, we demonstrate the use of the method via a simple simulated experiment.  ( 2 min )
    Building separable approximations for quantum states via neural networks. (arXiv:2112.08055v5 [quant-ph] UPDATED)
    Finding the closest separable state to a given target state is a notoriously difficult task, even more difficult than deciding whether a state is entangled or separable. To tackle this task, we parametrize separable states with a neural network and train it to minimize the distance to a given target state, with respect to a differentiable distance, such as the trace distance or Hilbert--Schmidt distance. By examining the output of the algorithm, we obtain an upper bound on the entanglement of the target state, and construct an approximation for its closest separable state. We benchmark the method on a variety of well-known classes of bipartite states and find excellent agreement, even up to local dimension of $d=10$, while providing conjectures and analytic insight for isotropic and Werner states. Moreover, we show our method to be efficient in the multipartite case, considering different notions of separability. Examining three and four-party GHZ and W states we recover known bounds and obtain additional ones, for instance for triseparability.  ( 3 min )
    Patient-specific modelling, simulation and real time processing for constrictive respiratory diseases. (arXiv:2207.01082v2 [eess.IV] UPDATED)
    Asthma is a common chronic disease of the respiratory system causing significant disability and societal burden. It affects over 500 million people worldwide and generates costs exceeding $USD 56 billion in 2011 in the United States. Managing asthma involves controlling symptoms, preventing exacerbations, and maintaining lung function. Improving asthma control affects the daily life of patients and is associated with a reduced risk of exacerbations and lung function impairment, reduces the cost of asthma care and indirect costs associated with reduced productivity. Understanding the complex dynamics of the pulmonary system and the lung's response to disease, injury, and treatment is fundamental to the advancement of Asthma treatment. Computational models of the respiratory system seek to provide a theoretical framework to understand the interaction between structure and function. Their application can improve pulmonary medicine by a patient-specific approach to medicinal methodologies optimizing the delivery given the personalized geometry and personalized ventilation patterns while introducing a patient-specific technique that maximizes drug delivery. A three-fold objective addressed within this dissertation becomes prominent at this point. The first part refers to the comprehension of pulmonary pathophysiology and the mechanics of Asthma and subsequently of constrictive pulmonary conditions in general. The second part refers to the design and implementation of tools that facilitate personalized medicine to improve delivery and effectiveness. Finally, the third part refers to the self-management of the condition, meaning that medical personnel and patients have access to tools and methods that allow the first party to easily track the course of the condition and the second party, i.e. the patient to easily self-manage it alleviating the significant burden from the health system.  ( 3 min )
    An Exploration of How Training Set Composition Bias in Machine Learning Affects Identifying Rare Objects. (arXiv:2207.03207v1 [cs.LG])
    When training a machine learning classifier on data where one of the classes is intrinsically rare, the classifier will often assign too few sources to the rare class. To address this, it is common to up-weight the examples of the rare class to ensure it isn't ignored. It is also a frequent practice to train on restricted data where the balance of source types is closer to equal for the same reason. Here we show that these practices can bias the model toward over-assigning sources to the rare class. We also explore how to detect when training data bias has had a statistically significant impact on the trained model's predictions, and how to reduce the bias's impact. While the magnitude of the impact of the techniques developed here will vary with the details of the application, for most cases it should be modest. They are, however, universally applicable to every time a machine learning classification model is used, making them analogous to Bessel's correction to the sample variance.
    Quantum Advantage in Variational Bayes Inference. (arXiv:2207.03104v1 [stat.ML])
    Variational Bayes (VB) inference algorithm is used widely to estimate both the parameters and the unobserved hidden variables in generative statistical models. The algorithm -- inspired by variational methods used in computational physics -- is iterative and can get easily stuck in local minima, even when classical techniques, such as deterministic annealing (DA), are used. We study a variational Bayes (VB) inference algorithm based on a non-traditional quantum annealing approach -- referred to as quantum annealing variational Bayes (QAVB) inference -- and show that there is indeed a quantum advantage to QAVB over its classical counterparts. In particular, we show that such better performance is rooted in key concepts from quantum mechanics: (i) the ground state of the Hamiltonian of a quantum system -- defined from the given variational Bayes (VB) problem -- corresponds to an optimal solution for the minimization problem of the variational free energy at very low temperatures; (ii) such a ground state can be achieved by a technique paralleling the quantum annealing process; and (iii) starting from this ground state, the optimal solution to the VB problem can be achieved by increasing the heat bath temperature to unity, and thereby avoiding local minima introduced by spontaneous symmetry breaking observed in classical physics based VB algorithms. We also show that the update equations of QAVB can be potentially implemented using $\lceil \log K \rceil$ qubits and $\mathcal{O} (K)$ operations per step. Thus, QAVB can match the time complexity of existing VB algorithms, while delivering higher performance.
    Lower Bounds on the Generalization Error of Nonlinear Learning Models. (arXiv:2103.14723v3 [stat.ML] UPDATED)
    We study in this paper lower bounds for the generalization error of models derived from multi-layer neural networks, in the regime where the size of the layers is commensurate with the number of samples in the training data. We show that unbiased estimators have unacceptable performance for such nonlinear networks in this regime. We derive explicit generalization lower bounds for general biased estimators, in the cases of linear regression and of two-layered networks. In the linear case the bound is asymptotically tight. In the nonlinear case, we provide a comparison of our bounds with an empirical study of the stochastic gradient descent algorithm. The analysis uses elements from the theory of large random matrices.  ( 2 min )
    On the Equivalence between Neural Network and Support Vector Machine. (arXiv:2111.06063v2 [stat.ML] UPDATED)
    Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) \citep{jacot2018neural}. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK \citep{arora2019exact}. However, the equivalence is only known for ridge regression currently \citep{arora2019harnessing}, while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalences between NNs and a broad family of $\ell_2$ regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) \textit{non-vacuous} generalization bound of NN via the corresponding KM; (ii) \textit{non-trivial} robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments is available at \url{https://github.com/leslie-CH/equiv-nn-svm}.  ( 3 min )
    Adaptive Resonance Theory-based Topological Clustering with a Divisive Hierarchical Structure Capable of Continual Learning. (arXiv:2201.10713v4 [cs.LG] UPDATED)
    Adaptive Resonance Theory (ART) is considered as an effective approach for realizing continual learning thanks to its ability to handle the plasticity-stability dilemma. In general, however, the clustering performance of ART-based algorithms strongly depends on the specification of a similarity threshold, i.e., a vigilance parameter, which is data-dependent and specified by hand. This paper proposes an ART-based topological clustering algorithm with a mechanism that automatically estimates a similarity threshold from the distribution of data points. In addition, for improving information extraction performance, a divisive hierarchical clustering algorithm capable of continual learning is proposed by introducing a hierarchical structure to the proposed algorithm. Experimental results demonstrate that the proposed algorithm has high clustering performance comparable with recently-proposed state-of-the-art hierarchical clustering algorithms.  ( 2 min )
    Inferring Structural Parameters of Low-Surface-Brightness-Galaxies with Uncertainty Quantification using Bayesian Neural Networks. (arXiv:2207.03471v1 [astro-ph.IM])
    Measuring the structural parameters (size, total brightness, light concentration, etc.) of galaxies is a significant first step towards a quantitative description of different galaxy populations. In this work, we demonstrate that a Bayesian Neural Network (BNN) can be used for the inference, with uncertainty quantification, of such morphological parameters from simulated low-surface-brightness galaxy images. Compared to traditional profile-fitting methods, we show that the uncertainties obtained using BNNs are comparable in magnitude, well-calibrated, and the point estimates of the parameters are closer to the true values. Our method is also significantly faster, which is very important with the advent of the era of large galaxy surveys and big data in astrophysics.  ( 2 min )
    Don't overfit the history -- Recursive time series data augmentation. (arXiv:2207.02891v1 [cs.LG])
    Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks, we need to understand that we fit our model on available data, which is a unique realized history. Training on a single realization often induces severe overfitting lacking generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call Recursive Interpolation Method, denoted as RIM. New samples are generated using a recursive interpolation function of all previous values in such a way that the enhanced samples preserve the original inherent time series dynamics. We perform theoretical analysis to characterize the proposed RIM and to guarantee its test performance. We apply RIM to diverse real world time series cases to achieve strong performance over non-augmented data on regression, classification, and reinforcement learning tasks.  ( 2 min )
    Training Transformers Together. (arXiv:2207.03481v1 [cs.LG])
    The infrastructure necessary for training state-of-the-art models is becoming overly expensive, which makes training such models affordable only to large corporations and institutions. Recent work proposes several methods for training such models collaboratively, i.e., by pooling together hardware from many independent parties and training a shared model over the Internet. In this demonstration, we collaboratively trained a text-to-image transformer similar to OpenAI DALL-E. We invited the viewers to join the ongoing training run, showing them instructions on how to contribute using the available hardware. We explained how to address the engineering challenges associated with such a training run (slow communication, limited memory, uneven performance between devices, and security concerns) and discussed how the viewers can set up collaborative training runs themselves. Finally, we show that the resulting model generates images of reasonable quality on a number of prompts.  ( 2 min )
    Back to the Basics: Revisiting Out-of-Distribution Detection Baselines. (arXiv:2207.03061v1 [cs.LG])
    We study simple methods for out-of-distribution (OOD) image detection that are compatible with any already trained classifier, relying on only its predictions or learned representations. Evaluating the OOD detection performance of various methods when utilized with ResNet-50 and Swin Transformer models, we find methods that solely consider the model's predictions can be easily outperformed by also considering the learned representations. Based on our analysis, we advocate for a dead-simple approach that has been neglected in other studies: simply flag as OOD images whose average distance to their K nearest neighbors is large (in the representation space of an image classifier trained on the in-distribution data).  ( 2 min )
    Finding Fallen Objects Via Asynchronous Audio-Visual Integration. (arXiv:2207.03483v1 [cs.CV])
    The way an object looks and sounds provide complementary reflections of its physical properties. In many settings cues from vision and audition arrive asynchronously but must be integrated, as when we hear an object dropped on the floor and then must find it. In this paper, we introduce a setting in which to study multi-modal object localization in 3D virtual environments. An object is dropped somewhere in a room. An embodied robot agent, equipped with a camera and microphone, must determine what object has been dropped -- and where -- by combining audio and visual signals with knowledge of the underlying physics. To study this problem, we have generated a large-scale dataset -- the Fallen Objects dataset -- that includes 8000 instances of 30 physical object categories in 64 rooms. The dataset uses the ThreeDWorld platform which can simulate physics-based impact sounds and complex physical interactions between objects in a photorealistic setting. As a first step toward addressing this challenge, we develop a set of embodied agent baselines, based on imitation learning, reinforcement learning, and modular planning, and perform an in-depth analysis of the challenge of this new task.  ( 3 min )
    Low-resource Low-footprint Wake-word Detection using Knowledge Distillation. (arXiv:2207.03331v1 [eess.AS])
    As virtual assistants have become more diverse and specialized, so has the demand for application or brand-specific wake words. However, the wake-word-specific datasets typically used to train wake-word detectors are costly to create. In this paper, we explore two techniques to leverage acoustic modeling data for large-vocabulary speech recognition to improve a purpose-built wake-word detector: transfer learning and knowledge distillation. We also explore how these techniques interact with time-synchronous training targets to improve detection latency. Experiments are presented on the open-source "Hey Snips" dataset and a more challenging in-house far-field dataset. Using phone-synchronous targets and knowledge distillation from a large acoustic model, we are able to improve accuracy across dataset sizes for both datasets while reducing latency.  ( 2 min )
    Comparing the Utility and Disclosure Risk of Synthetic Data with Samples of Microdata. (arXiv:2207.03339v1 [cs.CR])
    Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible. This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the sample fraction which has equivalent utility and risk to the synthetic data. Three commonly used data synthesis packages are compared with some interesting results. Further work is needed in several directions but the methodology looks very promising.  ( 2 min )
    Machine learning of percolation models using graph convolutional neural networks. (arXiv:2207.03368v1 [cond-mat.stat-mech])
    Percolation is an important topic in climate, physics, materials science, epidemiology, finance, and so on. Prediction of percolation thresholds with machine learning methods remains challenging. In this paper, we build a powerful graph convolutional neural network to study the percolation in both supervised and unsupervised ways. From a supervised learning perspective, the graph convolutional neural network simultaneously and correctly trains data of different lattice types, such as the square and triangular lattices. For the unsupervised perspective, combining the graph convolutional neural network and the confusion method, the percolation threshold can be obtained by the "W" shaped performance. The finding of this work opens up the possibility of building a more general framework that can probe the percolation-related phenomenon.  ( 2 min )
    For Learning in Symmetric Teams, Local Optima are Global Nash Equilibria. (arXiv:2207.03470v1 [cs.GT])
    Although it has been known since the 1970s that a globally optimal strategy profile in a common-payoff game is a Nash equilibrium, global optimality is a strict requirement that limits the result's applicability. In this work, we show that any locally optimal symmetric strategy profile is also a (global) Nash equilibrium. Furthermore, we show that this result is robust to perturbations to the common payoff and to the local optimum. Applied to machine learning, our result provides a global guarantee for any gradient method that finds a local optimum in symmetric strategy space. While this result indicates stability to unilateral deviation, we nevertheless identify broad classes of games where mixed local optima are unstable under joint, asymmetric deviations. We analyze the prevalence of instability by running learning algorithms in a suite of symmetric games, and we conclude by discussing the applicability of our results to multi-agent RL, cooperative inverse RL, and decentralized POMDPs.  ( 2 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v4 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable, e.g., a decision tree of depth 5 is easier to interpret than one of depth 50. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. We propose a model agnostic technique to minimize this trade-off. Our strategy is to first learn an oracle, a highly accurate probabilistic model on the training data. The uncertainty in the oracle's predictions are used to learn a sampling distribution for the training data. The interpretable model is then trained on a data sample obtained using this distribution, leading often to significantly greater accuracy. We formulate the sampling strategy as an optimization problem. Our solution1 possesses the following key favorable properties: (1) it uses a fixed number of seven optimization variables, irrespective of the dimensionality of the data (2) it is model agnostic - in that both the interpretable model and the oracle may belong to arbitrary model families (3) it has a flexible notion of model size, and can accommodate vector sizes (4) it is a framework, enabling it to benefit from progress in the area of optimization. We also present the following interesting observations: (a) In general, the optimal training distribution at small model sizes is different from the test distribution; (b) This effect exists even when the interpretable model and the oracle are from highly disparate model families: we show this on a text classification task, by using a Gated Recurrent Unit network as an oracle to improve the sequence classification accuracy of a Decision Tree that uses character n-grams; (c) Our technique may be used to identify an optimal training sample of a given sample size, for a model.  ( 3 min )
    Distilling Ensemble of Explanations for Weakly-Supervised Pre-Training of Image Segmentation Models. (arXiv:2207.03335v1 [cs.CV])
    While fine-tuning pre-trained networks has become a popular way to train image segmentation models, such backbone networks for image segmentation are frequently pre-trained using image classification source datasets, e.g., ImageNet. Though image classification datasets could provide the backbone networks with rich visual features and discriminative ability, they are incapable of fully pre-training the target model (i.e., backbone+segmentation modules) in an end-to-end manner. The segmentation modules are left to random initialization in the fine-tuning process due to the lack of segmentation labels in classification datasets. In our work, we propose a method that leverages Pseudo Semantic Segmentation Labels (PSSL), to enable the end-to-end pre-training for image segmentation models based on classification datasets. PSSL was inspired by the observation that the explanation results of classification models, obtained through explanation algorithms such as CAM, SmoothGrad and LIME, would be close to the pixel clusters of visual objects. Specifically, PSSL is obtained for each image by interpreting the classification results and aggregating an ensemble of explanations queried from multiple classifiers to lower the bias caused by single models. With PSSL for every image of ImageNet, the proposed method leverages a weighted segmentation learning procedure to pre-train the segmentation network en masse. Experiment results show that, with ImageNet accompanied by PSSL as the source dataset, the proposed end-to-end pre-training strategy successfully boosts the performance of various segmentation models, i.e., PSPNet-ResNet50, DeepLabV3-ResNet50, and OCRNet-HRNetW18, on a number of segmentation tasks, such as CamVid, VOC-A, VOC-C, ADE20K, and CityScapes, with significant improvements. The source code is availabel at https://github.com/PaddlePaddle/PaddleSeg.  ( 3 min )
    Market Making with Scaled Beta Policies. (arXiv:2207.03352v1 [q-fin.TR])
    This paper introduces a new representation for the actions of a market maker in an order-driven market. This representation uses scaled beta distributions, and generalises three approaches taken in the artificial intelligence for market making literature: single price-level selection, ladder strategies and "market making at the touch". Ladder strategies place uniform volume across an interval of contiguous prices. Scaled beta distribution based policies generalise these, allowing volume to be skewed across the price interval. We demonstrate that this flexibility is useful for inventory management, one of the key challenges faced by a market maker. In this paper, we conduct three main experiments: first, we compare our more flexible beta-based actions with the special case of ladder strategies; then, we investigate the performance of simple fixed distributions; and finally, we devise and evaluate a simple and intuitive dynamic control policy that adjusts actions in a continuous manner depending on the signed inventory that the market maker has acquired. All empirical evaluations use a high-fidelity limit order book simulator based on historical data with 50 levels on each side.  ( 2 min )
    VecGAN: Image-to-Image Translation with Interpretable Latent Directions. (arXiv:2207.03411v1 [cs.CV])
    We propose VecGAN, an image-to-image translation framework for facial attribute editing with interpretable latent directions. Facial attribute editing task faces the challenges of precise attribute editing with controllable strength and preservation of the other attributes of an image. For this goal, we design the attribute editing by latent space factorization and for each attribute, we learn a linear direction that is orthogonal to the others. The other component is the controllable strength of the change, a scalar value. In our framework, this scalar can be either sampled or encoded from a reference image by projection. Our work is inspired by the latent space factorization works of fixed pretrained GANs. However, while those models cannot be trained end-to-end and struggle to edit encoded images precisely, VecGAN is end-to-end trained for image translation task and successful at editing an attribute while preserving the others. Our extensive experiments show that VecGAN achieves significant improvements over state-of-the-arts for both local and global edits.  ( 2 min )
    Calibrate to Interpret. (arXiv:2207.03324v1 [cs.LG])
    Trustworthy machine learning is driving a large number of ML community works in order to improve ML acceptance and adoption. The main aspect of trustworthy machine learning are the followings: fairness, uncertainty, robustness, explainability and formal guaranties. Each of these individual domains gains the ML community interest, visible by the number of related publications. However few works tackle the interconnection between these fields. In this paper we show a first link between uncertainty and explainability, by studying the relation between calibration and interpretation. As the calibration of a given model changes the way it scores samples, and interpretation approaches often rely on these scores, it seems safe to assume that the confidence-calibration of a model interacts with our ability to interpret such model. In this paper, we show, in the context of networks trained on image classification tasks, to what extent interpretations are sensitive to confidence-calibration. It leads us to suggest a simple practice to improve the interpretation outcomes: Calibrate to Interpret.  ( 2 min )
    Learning the Quality of Machine Permutations in Job Shop Scheduling. (arXiv:2207.03244v1 [cs.LG])
    In recent years, the power demonstrated by Machine Learning (ML) has increasingly attracted the interest of the optimization community that is starting to leverage ML for enhancing and automating the design of optimal and approximate algorithms. One combinatorial optimization problem that has been tackled with ML is the Job Shop scheduling Problem (JSP). Most of the recent works focusing on the JSP and ML are based on Deep Reinforcement Learning (DRL), and only a few of them leverage supervised learning techniques. The recurrent reasons for avoiding supervised learning seem to be the difficulty in casting the right learning task, i.e., what is meaningful to predict, and how to obtain labels. Therefore, we first propose a novel supervised learning task that aims at predicting the quality of machine permutations. Then, we design an original methodology to estimate this quality that allows to create an accurate sequential deep learning model (binary accuracy above 95%). Finally, we empirically demonstrate the value of predicting the quality of machine permutations by enhancing the performance of a simple Tabu Search algorithm inspired by the works in the literature.  ( 2 min )
    A Solver + Gradient Descent Training Algorithm for Deep Neural Networks. (arXiv:2207.03264v1 [cs.LG])
    We present a novel hybrid algorithm for training Deep Neural Networks that combines the state-of-the-art Gradient Descent (GD) method with a Mixed Integer Linear Programming (MILP) solver, outperforming GD and variants in terms of accuracy, as well as resource and data efficiency for both regression and classification tasks. Our GD+Solver hybrid algorithm, called GDSolver, works as follows: given a DNN $D$ as input, GDSolver invokes GD to partially train $D$ until it gets stuck in a local minima, at which point GDSolver invokes an MILP solver to exhaustively search a region of the loss landscape around the weight assignments of $D$'s final layer parameters with the goal of tunnelling through and escaping the local minima. The process is repeated until desired accuracy is achieved. In our experiments, we find that GDSolver not only scales well to additional data and very large model sizes, but also outperforms all other competing methods in terms of rates of convergence and data efficiency. For regression tasks, GDSolver produced models that, on average, had 31.5% lower MSE in 48% less time, and for classification tasks on MNIST and CIFAR10, GDSolver was able to achieve the highest accuracy over all competing methods, using only 50% of the training data that GD baselines required.  ( 2 min )
    Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space. (arXiv:2207.03036v1 [cs.LG])
    This paper addresses an important problem of ranking the pre-trained deep neural networks and screening the most transferable ones for downstream tasks. It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive. Recent advanced methods proposed several lightweight transferability metrics to predict the fine-tuning results. However, these approaches only capture static representations but neglect the fine-tuning dynamics. To this end, this paper proposes a new transferability metric, called \textbf{S}elf-challenging \textbf{F}isher \textbf{D}iscriminant \textbf{A}nalysis (\textbf{SFDA}), which has many appealing benefits that existing works do not have. First, SFDA can embed the static features into a Fisher space and refine them for better separability between classes. Second, SFDA uses a self-challenging mechanism to encourage different pre-trained models to differentiate on hard examples. Third, SFDA can easily select multiple pre-trained models for the model ensemble. Extensive experiments on $33$ pre-trained models of $11$ downstream tasks show that SFDA is efficient, effective, and robust when measuring the transferability of pre-trained models. For instance, compared with the state-of-the-art method NLEEP, SFDA demonstrates an average of $59.1$\% gain while bringing $22.5$x speedup in wall-clock time. The code will be available at \url{https://github.com/TencentARC/SFDA}.  ( 3 min )
    Robust optimal well control using an adaptive multi-grid reinforcement learning framework. (arXiv:2207.03253v1 [cs.LG])
    Reinforcement learning (RL) is a promising tool to solve robust optimal well control problems where the model parameters are highly uncertain, and the system is partially observable in practice. However, RL of robust control policies often relies on performing a large number of simulations. This could easily become computationally intractable for cases with computationally intensive simulations. To address this bottleneck, an adaptive multi-grid RL framework is introduced which is inspired by principles of geometric multi-grid methods used in iterative numerical algorithms. RL control policies are initially learned using computationally efficient low fidelity simulations using coarse grid discretization of the underlying partial differential equations (PDEs). Subsequently, the simulation fidelity is increased in an adaptive manner towards the highest fidelity simulation that correspond to finest discretization of the model domain. The proposed framework is demonstrated using a state-of-the-art, model-free policy-based RL algorithm, namely the Proximal Policy Optimisation (PPO) algorithm. Results are shown for two case studies of robust optimal well control problems which are inspired from SPE-10 model 2 benchmark case studies. Prominent gains in the computational efficiency is observed using the proposed framework saving around 60-70% of computational cost of its single fine-grid counterpart.  ( 2 min )
    Revisiting Pretraining Objectives for Tabular Deep Learning. (arXiv:2207.03208v1 [cs.LG])
    Recent deep learning models for tabular data currently compete with the traditional ML models based on decision trees (GBDT). Unlike GBDT, deep models can additionally benefit from pretraining, which is a workhorse of DL for vision and NLP. For tabular problems, several pretraining methods were proposed, but it is not entirely clear if pretraining provides consistent noticeable improvements and what method should be used, since the methods are often not compared to each other or comparison is limited to the simplest MLP architectures. In this work, we aim to identify the best practices to pretrain tabular DL models that can be universally applied to different datasets and architectures. Among our findings, we show that using the object target labels during the pretraining stage is beneficial for the downstream performance and advocate several target-aware pretraining objectives. Overall, our experiments demonstrate that properly performed pretraining significantly increases the performance of tabular DL models, which often leads to their superiority over GBDTs.  ( 2 min )
    Factorizing Knowledge in Neural Networks. (arXiv:2207.03337v1 [cs.CV])
    In this paper, we explore a novel and ambitious knowledge-transfer task, termed Knowledge Factorization~(KF). The core idea of KF lies in the modularization and assemblability of knowledge: given a pretrained network model as input, KF aims to decompose it into several factor networks, each of which handles only a dedicated task and maintains task-specific knowledge factorized from the source network. Such factor networks are task-wise disentangled and can be directly assembled, without any fine-tuning, to produce the more competent combined-task networks. In other words, the factor networks serve as Lego-brick-like building blocks, allowing us to construct customized networks in a plug-and-play manner. Specifically, each factor network comprises two modules, a common-knowledge module that is task-agnostic and shared by all factor networks, alongside with a task-specific module dedicated to the factor network itself. We introduce an information-theoretic objective, InfoMax-Bottleneck~(IMB), to carry out KF by optimizing the mutual information between the learned representations and input. Experiments across various benchmarks demonstrate that, the derived factor networks yield gratifying performances on not only the dedicated tasks but also disentanglement, while enjoying much better interpretability and modularity. Moreover, the learned common-knowledge representations give rise to impressive results on transfer learning.  ( 2 min )
    Vessel-following model for inland waterways based on deep reinforcement learning. (arXiv:2207.03257v1 [cs.CE])
    While deep reinforcement learning (RL) has been increasingly applied in designing car-following models in the last years, this study aims at investigating the feasibility of RL-based vehicle-following for complex vehicle dynamics and strong environmental disturbances. As a use case, we developed an inland waterways vessel-following model based on realistic vessel dynamics, which considers environmental influences, such as varying stream velocity and river profile. We extracted natural vessel behavior from anonymized AIS data to formulate a reward function that reflects a realistic driving style next to comfortable and safe navigation. Aiming at high generalization capabilities, we propose an RL training environment that uses stochastic processes to model leading trajectory and river dynamics. To validate the trained model, we defined different scenarios that have not been seen in training, including realistic vessel-following on the Middle Rhine. Our model demonstrated safe and comfortable driving in all scenarios, proving excellent generalization abilities. Furthermore, traffic oscillations could effectively be dampened by deploying the trained model on a sequence of following vessels.  ( 2 min )
    Pre-training helps Bayesian optimization too. (arXiv:2207.03084v1 [cs.LG])
    Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs on functions. However, even with expert knowledge, it is not an easy task to select a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.  ( 3 min )
    Attention Round for Post-Training Quantization. (arXiv:2207.03088v1 [cs.LG])
    At present, the quantification methods of neural network models are mainly divided into post-training quantization (PTQ) and quantization aware training (QAT). Post-training quantization only need a small part of the data to complete the quantification process, but the performance of its quantitative model is not as good as the quantization aware training. This paper presents a novel quantification method called Attention Round. This method gives parameters w the opportunity to be mapped to all possible quantized values, rather than just the two quantized values nearby w in the process of quantization. The probability of being mapped to different quantified values is negatively correlated with the distance between the quantified values and w, and decay with a Gaussian function. In addition, this paper uses the lossy coding length as a measure to assign bit widths to the different layers of the model to solve the problem of mixed precision quantization, which effectively avoids to solve combinatorial optimization problem. This paper also performs quantitative experiments on different models, the results confirm the effectiveness of the proposed method. For ResNet18 and MobileNetV2, the post-training quantization proposed in this paper only require 1,024 training data and 10 minutes to complete the quantization process, which can achieve quantization performance on par with quantization aware training.  ( 3 min )
    A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits. (arXiv:2207.03106v1 [cs.LG])
    We study federated contextual linear bandits, where $M$ agents cooperate with each other to solve a global contextual linear bandit problem with the help of a central server. We consider the asynchronous setting, where all agents work independently and the communication between one agent and the server will not trigger other agents' communication. We propose a simple algorithm named \texttt{FedLinUCB} based on the principle of optimism. We prove that the regret of \texttt{FedLinUCB} is bounded by $\tilde{O}(d\sqrt{\sum_{m=1}^M T_m})$ and the communication complexity is $\tilde{O}(dM^2)$, where $d$ is the dimension of the contextual vector and $T_m$ is the total number of interactions with the environment by $m$-th agent. To the best of our knowledge, this is the first provably efficient algorithm that allows fully asynchronous communication for federated contextual linear bandits, while achieving the same regret guarantee as in the single-agent setting.  ( 2 min )
    Learning Invariant World State Representations with Predictive Coding. (arXiv:2207.02972v1 [cs.LG])
    Self-supervised learning methods overcome the key bottleneck for building more capable AI: limited availability of labeled data. However, one of the drawbacks of self-supervised architectures is that the representations that they learn are implicit and it is hard to extract meaningful information about the encoded world states, such as 3D structure of the visual scene encoded in a depth map. Moreover, in the visual domain such representations only rarely undergo evaluations that may be critical for downstream tasks, such as vision for autonomous cars. Herein, we propose a framework for evaluating visual representations for illumination invariance in the context of depth perception. We develop a new predictive coding-based architecture and a hybrid fully-supervised/self-supervised learning method. We propose a novel architecture that extends the predictive coding approach: PRedictive Lateral bottom-Up and top-Down Encoder-decoder Network (PreludeNet), which explicitly learns to infer and predict depth from video frames. In PreludeNet, the encoder's stack of predictive coding layers is trained in a self-supervised manner, while the predictive decoder is trained in a supervised manner to infer or predict the depth. We evaluate the robustness of our model on a new synthetic dataset, in which lighting conditions (such as overall illumination, and effect of shadows) can be be parametrically adjusted while keeping all other aspects of the world constant. PreludeNet achieves both competitive depth inference performance and next frame prediction accuracy. We also show how this new network architecture, coupled with the hybrid fully-supervised/self-supervised learning method, achieves balance between the said performance and invariance to changes in lighting. The proposed framework for evaluating visual representations can be extended to diverse task domains and invariance tests.  ( 3 min )
    Context-aware Self-supervised Learning for Medical Images Using Graph Neural Network. (arXiv:2207.02957v1 [eess.IV])
    Although self-supervised learning enables us to bootstrap the training by exploiting unlabeled data, the generic self-supervised methods for natural images do not sufficiently incorporate the context. For medical images, a desirable method should be sensitive enough to detect deviation from normal-appearing tissue of each anatomical region; here, anatomy is the context. We introduce a novel approach with two levels of self-supervised representation learning objectives: one on the regional anatomical level and another on the patient-level. We use graph neural networks to incorporate the relationship between different anatomical regions. The structure of the graph is informed by anatomical correspondences between each patient and an anatomical atlas. In addition, the graph representation has the advantage of handling any arbitrarily sized image in full resolution. Experiments on large-scale Computer Tomography (CT) datasets of lung images show that our approach compares favorably to baseline methods that do not account for the context. We use the learned embedding for staging lung tissue abnormalities related to COVID-19.  ( 3 min )
    Model Agnostic Conformal Hyperparameter Optimization. (arXiv:2207.03017v1 [cs.LG])
    Several novel frameworks for hyperparameter search have emerged in the last decade, but most rely on strict, often normal, distributional assumptions, limiting search model flexibility. This paper proposes a novel optimization framework based on Conformal prediction, assuming only exchangeability, and allowing for a larger choice of search model architectures and variance estimators. Several such models are explored and benchmarked against random hyperparameter search on both dense and convolutional neural networks with consistent overperformance both in final loss achieved and time to achievement.  ( 2 min )
    A State Transition Model for Mobile Notifications via Survival Analysis. (arXiv:2207.03099v1 [stat.ML])
    Mobile notifications have become a major communication channel for social networking services to keep users informed and engaged. As more mobile applications push notifications to users, they constantly face decisions on what to send, when and how. A lack of research and methodology commonly leads to heuristic decision making. Many notifications arrive at an inappropriate moment or introduce too many interruptions, failing to provide value to users and spurring users' complaints. In this paper we explore unique features of interactions between mobile notifications and user engagement. We propose a state transition framework to quantitatively evaluate the effectiveness of notifications. Within this framework, we develop a survival model for badging notifications assuming a log-linear structure and a Weibull distribution. Our results show that this model achieves more flexibility for applications and superior prediction accuracy than a logistic regression model. In particular, we provide an online use case on notification delivery time optimization to show how we make better decisions, drive more user engagement, and provide more value to users.  ( 2 min )
    Interactive Combinatorial Bandits: Balancing Competitivity and Complementarity. (arXiv:2207.03091v1 [cs.LG])
    We study non-modular function maximization in the online interactive bandit setting. We are motivated by applications where there is a natural complementarity between certain elements: e.g., in a movie recommendation system, watching the first movie in a series complements the experience of watching a second (and a third, etc.). This is not expressible using only submodular functions which can represent only competitiveness between elements. We extend the purely submodular approach in two ways. First, we assume that the objective can be decomposed into the sum of monotone suBmodular and suPermodular function, known as a BP objective. Here, complementarity is naturally modeled by the supermodular component. We develop a UCB-style algorithm, where at each round a noisy gain is revealed after an action is taken that balances refining beliefs about the unknown objectives (exploration) and choosing actions that appear promising (exploitation). Defining regret in terms of submodular and supermodular curvature with respect to a full-knowledge greedy baseline, we show that this algorithm achieves at most $O(\sqrt{T})$ regret after $T$ rounds of play. Second, for those functions that do not admit a BP structure, we provide analogous regret guarantees in terms of their submodularity ratio; this is applicable for functions that are almost, but not quite, submodular. We numerically study the tasks of movie recommendation on the MovieLens dataset, and selection of training subsets for classification. Through these examples, we demonstrate the algorithm's performance as well as the shortcomings of viewing these problems as being solely submodular.  ( 3 min )
    A conditional gradient homotopy method with applications to Semidefinite Programming. (arXiv:2207.03101v1 [math.OC])
    We propose a new homotopy-based conditional gradient method for solving convex optimization problems with a large number of simple conic constraints. Instances of this template naturally appear in semidefinite programming problems arising as convex relaxations of combinatorial optimization problems. Our method is a double-loop algorithm in which the conic constraint is treated via a self-concordant barrier, and the inner loop employs a conditional gradient algorithm to approximate the analytic central path, while the outer loop updates the accuracy imposed on the temporal solution and the homotopy parameter. Our theoretical iteration complexity is competitive when confronted to state-of-the-art SDP solvers, with the decisive advantage of cheap projection-free subroutines. Preliminary numerical experiments are provided for illustrating the practical performance of the method.  ( 2 min )
    Quantum compression with classically simulatable circuits. (arXiv:2207.02961v1 [quant-ph])
    As we continue to find applications where the currently available noisy devices exhibit an advantage over their classical counterparts, the efficient use of quantum resources is highly desirable. The notion of quantum autoencoders was proposed as a way for the compression of quantum information to reduce resource requirements. Here, we present a strategy to design quantum autoencoders using evolutionary algorithms for transforming quantum information into lower-dimensional representations. We successfully demonstrate the initial applications of the algorithm for compressing different families of quantum states. In particular, we point out that using a restricted gate set in the algorithm allows for efficient simulation of the generated circuits. This approach opens the possibility of using classical logic to find low representations of quantum data, using fewer computational resources.  ( 2 min )
    The "Collections as ML Data" Checklist for Machine Learning & Cultural Heritage. (arXiv:2207.02960v1 [cs.LG])
    Within the cultural heritage sector, there has been a growing and concerted effort to consider a critical sociotechnical lens when applying machine learning techniques to digital collections. Though the cultural heritage community has collectively developed an emerging body of work detailing responsible operations for machine learning in libraries and other cultural heritage institutions at the organizational level, there remains a paucity of guidelines created specifically for practitioners embarking on machine learning projects. The manifold stakes and sensitivities involved in applying machine learning to cultural heritage underscore the importance of developing such guidelines. This paper contributes to this need by formulating a detailed checklist with guiding questions and practices that can be employed while developing a machine learning project that utilizes cultural heritage data. I call the resulting checklist the "Collections as ML Data" checklist, which, when completed, can be published with the deliverables of the project. By surveying existing projects, including my own project, Newspaper Navigator, I justify the "Collections as ML Data" checklist and demonstrate how the formulated guiding questions can be employed and operationalized.  ( 2 min )
    Self-Supervised RF Signal Representation Learning for NextG Signal Classification with Deep Learning. (arXiv:2207.03046v1 [cs.NI])
    Deep learning (DL) finds rich applications in the wireless domain to improve spectrum awareness. Typically, the DL models are either randomly initialized following a statistical distribution or pretrained on tasks from other data domains such as computer vision (in the form of transfer learning) without accounting for the unique characteristics of wireless signals. Self-supervised learning enables the learning of useful representations from Radio Frequency (RF) signals themselves even when only limited training data samples with labels are available. We present the first self-supervised RF signal representation learning model and apply it to the automatic modulation recognition (AMR) task by specifically formulating a set of transformations to capture the wireless signal characteristics. We show that the sample efficiency (the number of labeled samples required to achieve a certain accuracy performance) of AMR can be significantly increased (almost an order of magnitude) by learning signal representations with self-supervised learning. This translates to substantial time and cost savings. Furthermore, self-supervised learning increases the model accuracy compared to the state-of-the-art DL methods and maintains high accuracy even when a small set of training data samples is used.  ( 2 min )
    The Union of Manifolds Hypothesis and its Implications for Deep Generative Modelling. (arXiv:2207.02862v1 [stat.ML])
    Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in data. Assuming the data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we put forth the union of manifolds hypothesis, which accommodates the existence of non-constant intrinsic dimensions. We empirically verify this hypothesis on commonly-used image datasets, finding that indeed, intrinsic dimension should be allowed to vary. We also show that classes with higher intrinsic dimensions are harder to classify, and how this insight can be used to improve classification accuracy. We then turn our attention to the impact of this hypothesis in the context of deep generative models (DGMs). Most current DGMs struggle to model datasets with several connected components and/or varying intrinsic dimensions. To tackle these shortcomings, we propose clustered DGMs, where we first cluster the data and then train a DGM on each cluster. We show that clustered DGMs can model multiple connected components with different intrinsic dimensions, and empirically outperform their non-clustered counterparts without increasing computational requirements.  ( 3 min )
    Humans Social Relationship Classification during Accompaniment. (arXiv:2207.02890v1 [cs.LG])
    This paper presents the design of deep learning architectures which allow to classify the social relationship existing between two people who are walking in a side-by-side formation into four possible categories --colleagues, couple, family or friendship. The models are developed using Neural Networks or Recurrent Neural Networks to achieve the classification and are trained and evaluated using a database of readings obtained from humans performing an accompaniment process in an urban environment. The best achieved model accomplishes a relatively good accuracy in the classification problem and its results enhance partially the outcomes from a previous study [1]. Furthermore, the model proposed shows its future potential to improve its efficiency and to be implemented in a real robot.  ( 2 min )
    Towards Substantive conceptions of Algorithmic Fairness: Normative guidance from Equal Opportunity doctrines. (arXiv:2207.02912v1 [cs.CY])
    In this work we use Equal Oppportunity (EO) doctrines from political philosophy to make explicit the normative judgements embedded in different conceptions of algorithmic fairness. We contrast formal EO approaches that narrowly focus on fair contests at discrete decision points, with substantive EO doctrines that look at people's fair life chances more holistically over the course of a lifetime. We use this taxonomy to provide a moral interpretation of the impossibility results as the incompatibility between different conceptions of a fair contest -- foward-looking versus backward-looking -- when people do not have fair life chances. We use this result to motivate substantive conceptions of algorithmic fairness and outline two plausible procedures based on the luck-egalitarian doctrine of EO, and Rawls's principle of fair equality of opportunity.  ( 2 min )
    Scoring Rules for Performative Binary Prediction. (arXiv:2207.02847v1 [cs.LG])
    We construct a model of expert prediction where predictions can influence the state of the world. Under this model, we show through theoretical and numerical results that proper scoring rules can incentivize experts to manipulate the world with their predictions. We also construct a simple class of scoring rules that avoids this problem.  ( 2 min )
    Local Sample-weighted Multiple Kernel Clustering with Consensus Discriminative Graph. (arXiv:2207.02846v1 [cs.LG])
    Multiple kernel clustering (MKC) is committed to achieving optimal information fusion from a set of base kernels. Constructing precise and local kernel matrices is proved to be of vital significance in applications since the unreliable distant-distance similarity estimation would degrade clustering per-formance. Although existing localized MKC algorithms exhibit improved performance compared to globally-designed competi-tors, most of them widely adopt KNN mechanism to localize kernel matrix by accounting for {\tau} -nearest neighbors. However, such a coarse manner follows an unreasonable strategy that the ranking importance of different neighbors is equal, which is impractical in applications. To alleviate such problems, this paper proposes a novel local sample-weighted multiple kernel clustering (LSWMKC) model. We first construct a consensus discriminative affinity graph in kernel space, revealing the latent local structures. Further, an optimal neighborhood kernel for the learned affinity graph is output with naturally sparse property and clear block diagonal structure. Moreover, LSWMKC im-plicitly optimizes adaptive weights on different neighbors with corresponding samples. Experimental results demonstrate that our LSWMKC possesses better local manifold representation and outperforms existing kernel or graph-based clustering algo-rithms. The source code of LSWMKC can be publicly accessed from https://github.com/liliangnudt/LSWMKC.  ( 2 min )
  • Open

    Some performance considerations when using multi-armed bandit algorithms in the presence of missing data. (arXiv:2205.03820v2 [stat.ML] UPDATED)
    When comparing the performance of multi-armed bandit algorithms, the potential impact of missing data is often overlooked. In practice, it also affects their implementation where the simplest approach to overcome this is to continue to sample according to the original bandit algorithm, ignoring missing outcomes. We investigate the impact on performance of this approach to deal with missing data for several bandit algorithms through an extensive simulation study assuming the rewards are missing at random. We focus on two-armed bandit algorithms with binary outcomes in the context of patient allocation for clinical trials with relatively small sample sizes. However, our results apply to other applications of bandit algorithms where missing data is expected to occur. We assess the resulting operating characteristics, including the expected reward. Different probabilities of missingness in both arms are considered. The key finding of our work is that when using the simplest strategy of ignoring missing data, the impact on the expected performance of multi-armed bandit strategies varies according to the way these strategies balance the exploration-exploitation trade-off. Algorithms that are geared towards exploration continue to assign samples to the arm with more missing responses (which being perceived as the arm with less observed information is deemed more appealing by the algorithm than it would otherwise be). In contrast, algorithms that are geared towards exploitation would rapidly assign a high value to samples from the arms with a current high mean irrespective of the level observations per arm. Furthermore, for algorithms focusing more on exploration, we illustrate that the problem of missing responses can be alleviated using a simple mean imputation approach.
    Neural Stein critics with staged $L^2$-regularization. (arXiv:2207.03406v1 [stat.ML])
    Learning to differentiate model distributions from observed data is a fundamental problem in statistics and machine learning, and high-dimensional data remains a challenging setting for such problems. Metrics that quantify the disparity in probability distributions, such as the Stein discrepancy, play an important role in statistical testing in high dimensions. In this paper, we consider the setting where one wishes to distinguish between data sampled from an unknown probability distribution and a nominal model distribution. While recent studies revealed that the optimal $L^2$-regularized Stein critic equals the difference of the score functions of two probability distributions up to a multiplicative constant, we investigate the role of $L^2$ regularization when training a neural network Stein discrepancy critic function. Motivated by the Neural Tangent Kernel theory of training neural networks, we develop a novel staging procedure for the weight of regularization over training time. This leverages the advantages of highly-regularized training at early times while also empirically delaying overfitting. Theoretically, we relate the training dynamic with large regularization weight to the kernel regression optimization of "lazy training" regime in early training times. The benefit of the staged $L^2$ regularization is demonstrated on simulated high dimensional distribution drift data and an application to evaluating generative models of image data.
    Reward is enough for convex MDPs. (arXiv:2106.00661v3 [cs.AI] UPDATED)
    Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called `pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) `players', using Fenchel duality. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
    Federated Robustness Propagation: Sharing Robustness in Heterogeneous Federated Learning. (arXiv:2106.10196v2 [cs.LG] UPDATED)
    Federated learning (FL) emerges as a popular distributed learning schema that learns a model from a set of participating users without sharing raw data. One major challenge of FL comes with heterogeneous users, who may have distributionally different (or non-iid) data and varying computation resources. As federated users would use the model for prediction, they often demand the trained model to be robust against malicious attackers at test time. Whereas adversarial training (AT) provides a sound solution for centralized learning, extending its usage for federated users has imposed significant challenges, as many users may have very limited training data and tight computational budgets, to afford the data-hungry and costly AT. In this paper, we study a novel FL strategy: propagating adversarial robustness from rich-resource users that can afford AT, to those with poor resources that cannot afford it, during federated learning. We show that existing FL techniques cannot be effectively integrated with the strategy to propagate robustness among non-iid users and propose an efficient propagation approach by the proper use of batch-normalization. We demonstrate the rationality and effectiveness of our method through extensive experiments. Especially, the proposed method is shown to grant federated models remarkable robustness even when only a small portion of users afford AT during learning. Source code will be released.
    Interpretable Deep Causal Learning for Moderation Effects. (arXiv:2206.10261v2 [cs.LG] UPDATED)
    In this extended abstract paper, we address the problem of interpretability and targeted regularization in causal machine learning models. In particular, we focus on the problem of estimating individual causal/treatment effects under observed confounders, which can be controlled for and moderate the effect of the treatment on the outcome of interest. Black-box ML models adjusted for the causal setting perform generally well in this task, but they lack interpretable output identifying the main drivers of treatment heterogeneity and their functional relationship. We propose a novel deep counterfactual learning architecture for estimating individual treatment effects that can simultaneously: i) convey targeted regularization on, and produce quantify uncertainty around the quantity of interest (i.e., the Conditional Average Treatment Effect); ii) disentangle baseline prognostic and moderating effects of the covariates and output interpretable score functions describing their relationship with the outcome. Finally, we demonstrate the use of the method via a simple simulated experiment.
    A Mutually Exciting Latent Space Hawkes Process Model for Continuous-time Networks. (arXiv:2205.09263v2 [cs.LG] UPDATED)
    Networks and temporal point processes serve as fundamental building blocks for modeling complex dynamic relational data in various domains. We propose the latent space Hawkes (LSH) model, a novel generative model for continuous-time networks of relational events, using a latent space representation for nodes. We model relational events between nodes using mutually exciting Hawkes processes with baseline intensities dependent upon the distances between the nodes in the latent space and sender and receiver specific effects. We demonstrate that our proposed LSH model can replicate many features observed in real temporal networks including reciprocity and transitivity, while also achieving superior prediction accuracy and providing more interpretable fits than existing models.
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v4 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable, e.g., a decision tree of depth 5 is easier to interpret than one of depth 50. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. We propose a model agnostic technique to minimize this trade-off. Our strategy is to first learn an oracle, a highly accurate probabilistic model on the training data. The uncertainty in the oracle's predictions are used to learn a sampling distribution for the training data. The interpretable model is then trained on a data sample obtained using this distribution, leading often to significantly greater accuracy. We formulate the sampling strategy as an optimization problem. Our solution1 possesses the following key favorable properties: (1) it uses a fixed number of seven optimization variables, irrespective of the dimensionality of the data (2) it is model agnostic - in that both the interpretable model and the oracle may belong to arbitrary model families (3) it has a flexible notion of model size, and can accommodate vector sizes (4) it is a framework, enabling it to benefit from progress in the area of optimization. We also present the following interesting observations: (a) In general, the optimal training distribution at small model sizes is different from the test distribution; (b) This effect exists even when the interpretable model and the oracle are from highly disparate model families: we show this on a text classification task, by using a Gated Recurrent Unit network as an oracle to improve the sequence classification accuracy of a Decision Tree that uses character n-grams; (c) Our technique may be used to identify an optimal training sample of a given sample size, for a model.
    Variational Nearest Neighbor Gaussian Process. (arXiv:2202.01694v3 [cs.LG] UPDATED)
    Variational approximations to Gaussian processes (GPs) typically use a small set of inducing points to form a low-rank approximation to the covariance matrix. In this work, we instead exploit a sparse approximation of the precision matrix. We propose variational nearest neighbor Gaussian process (VNNGP), which introduces a prior that only retains correlations within K nearest-neighboring observations, thereby inducing sparse precision structure. Using the variational framework, VNNGP's objective can be factorized over both observations and inducing points, enabling stochastic optimization with a time complexity of O($K^3$). Hence, we can arbitrarily scale the inducing point size, even to the point of putting inducing points at every observed location. We compare VNNGP to other scalable GPs through various experiments, and demonstrate that VNNGP (1) can dramatically outperform low-rank methods, and (2) is less prone to overfitting than other nearest neighbor methods.  ( 2 min )
    The Multivariate Community Hawkes Model for Dependent Relational Events in Continuous-time Networks. (arXiv:2205.00639v2 [stat.ME] UPDATED)
    The stochastic block model (SBM) is one of the most widely used generative models for network data. Many continuous-time dynamic network models are built upon the same assumption as the SBM: edges or events between all pairs of nodes are conditionally independent given the block or community memberships, which prevents them from reproducing higher-order motifs such as triangles that are commonly observed in real networks. We propose the multivariate community Hawkes (MULCH) model, an extremely flexible community-based model for continuous-time networks that introduces dependence between node pairs using structured multivariate Hawkes processes. We fit the model using a spectral clustering and likelihood-based local refinement procedure. We find that our proposed MULCH model is far more accurate than existing models both for predictive and generative tasks.  ( 2 min )
    Exact Matching of Random Graphs with Constant Correlation. (arXiv:2110.05000v2 [math.ST] UPDATED)
    This paper deals with the problem of graph matching or network alignment for Erd\H{o}s--R\'enyi graphs, which can be viewed as a noisy average-case version of the graph isomorphism problem. Let $G$ and $G'$ be $G(n, p)$ Erd\H{o}s--R\'enyi graphs marginally, identified with their adjacency matrices. Assume that $G$ and $G'$ are correlated such that $\mathbb{E}[G_{ij} G'_{ij}] = p(1-\alpha)$. For a permutation $\pi$ representing a latent matching between the vertices of $G$ and $G'$, denote by $G^\pi$ the graph obtained from permuting the vertices of $G$ by $\pi$. Observing $G^\pi$ and $G'$, we aim to recover the matching $\pi$. In this work, we show that for every $\varepsilon \in (0,1]$, there is $n_0>0$ depending on $\varepsilon$ and absolute constants $\alpha_0, R > 0$ with the following property. Let $n \ge n_0$, $(1+\varepsilon) \log n \le np \le n^{\frac{1}{R \log \log n}}$, and $0 < \alpha < \min(\alpha_0,\varepsilon/4)$. There is a polynomial-time algorithm $F$ such that $\mathbb{P}\{F(G^\pi,G')=\pi\}=1-o(1)$. This is the first polynomial-time algorithm that recovers the exact matching between vertices of correlated Erd\H{o}s--R\'enyi graphs with constant correlation with high probability. The algorithm is based on comparison of partition trees associated with the graph vertices.
    A State Transition Model for Mobile Notifications via Survival Analysis. (arXiv:2207.03099v1 [stat.ML])
    Mobile notifications have become a major communication channel for social networking services to keep users informed and engaged. As more mobile applications push notifications to users, they constantly face decisions on what to send, when and how. A lack of research and methodology commonly leads to heuristic decision making. Many notifications arrive at an inappropriate moment or introduce too many interruptions, failing to provide value to users and spurring users' complaints. In this paper we explore unique features of interactions between mobile notifications and user engagement. We propose a state transition framework to quantitatively evaluate the effectiveness of notifications. Within this framework, we develop a survival model for badging notifications assuming a log-linear structure and a Weibull distribution. Our results show that this model achieves more flexibility for applications and superior prediction accuracy than a logistic regression model. In particular, we provide an online use case on notification delivery time optimization to show how we make better decisions, drive more user engagement, and provide more value to users.  ( 2 min )
    Binary Iterative Hard Thresholding Converges with Optimal Number of Measurements for 1-Bit Compressed Sensing. (arXiv:2207.03427v1 [cs.IT])
    Compressed sensing has been a very successful high-dimensional signal acquisition and recovery technique that relies on linear operations. However, the actual measurements of signals have to be quantized before storing or processing. 1(One)-bit compressed sensing is a heavily quantized version of compressed sensing, where each linear measurement of a signal is reduced to just one bit: the sign of the measurement. Once enough of such measurements are collected, the recovery problem in 1-bit compressed sensing aims to find the original signal with as much accuracy as possible. The recovery problem is related to the traditional "halfspace-learning" problem in learning theory. For recovery of sparse vectors, a popular reconstruction method from 1-bit measurements is the binary iterative hard thresholding (BIHT) algorithm. The algorithm is a simple projected sub-gradient descent method, and is known to converge well empirically, despite the nonconvexity of the problem. The convergence property of BIHT was not theoretically justified, except with an exorbitantly large number of measurements (i.e., a number of measurement greater than $\max\{k^{10}, 24^{48}, k^{3.5}/\epsilon\}$, where $k$ is the sparsity, $\epsilon$ denotes the approximation error, and even this expression hides other factors). In this paper we show that the BIHT algorithm converges with only $\tilde{O}(\frac{k}{\epsilon})$ measurements. Note that, this dependence on $k$ and $\epsilon$ is optimal for any recovery method in 1-bit compressed sensing. With this result, to the best of our knowledge, BIHT is the only practical and efficient (polynomial time) algorithm that requires the optimal number of measurements in all parameters (both $k$ and $\epsilon$). This is also an example of a gradient descent algorithm converging to the correct solution for a nonconvex problem, under suitable structural conditions.  ( 3 min )
    Quantum Advantage in Variational Bayes Inference. (arXiv:2207.03104v1 [stat.ML])
    Variational Bayes (VB) inference algorithm is used widely to estimate both the parameters and the unobserved hidden variables in generative statistical models. The algorithm -- inspired by variational methods used in computational physics -- is iterative and can get easily stuck in local minima, even when classical techniques, such as deterministic annealing (DA), are used. We study a variational Bayes (VB) inference algorithm based on a non-traditional quantum annealing approach -- referred to as quantum annealing variational Bayes (QAVB) inference -- and show that there is indeed a quantum advantage to QAVB over its classical counterparts. In particular, we show that such better performance is rooted in key concepts from quantum mechanics: (i) the ground state of the Hamiltonian of a quantum system -- defined from the given variational Bayes (VB) problem -- corresponds to an optimal solution for the minimization problem of the variational free energy at very low temperatures; (ii) such a ground state can be achieved by a technique paralleling the quantum annealing process; and (iii) starting from this ground state, the optimal solution to the VB problem can be achieved by increasing the heat bath temperature to unity, and thereby avoiding local minima introduced by spontaneous symmetry breaking observed in classical physics based VB algorithms. We also show that the update equations of QAVB can be potentially implemented using $\lceil \log K \rceil$ qubits and $\mathcal{O} (K)$ operations per step. Thus, QAVB can match the time complexity of existing VB algorithms, while delivering higher performance.  ( 3 min )
    Back to the Basics: Revisiting Out-of-Distribution Detection Baselines. (arXiv:2207.03061v1 [cs.LG])
    We study simple methods for out-of-distribution (OOD) image detection that are compatible with any already trained classifier, relying on only its predictions or learned representations. Evaluating the OOD detection performance of various methods when utilized with ResNet-50 and Swin Transformer models, we find methods that solely consider the model's predictions can be easily outperformed by also considering the learned representations. Based on our analysis, we advocate for a dead-simple approach that has been neglected in other studies: simply flag as OOD images whose average distance to their K nearest neighbors is large (in the representation space of an image classifier trained on the in-distribution data).  ( 2 min )
    Learning towards Robustness in Causally-Invariant Predictors. (arXiv:2107.01876v2 [stat.ML] UPDATED)
    We propose to learn an invariant causal predictor that is robust to distributional shifts, in the supervised regression scenario. Based on a disentangled causal factorization that describes the underlying data generating process, we attribute the distributional shifts to mutation of generating factors, which covers a wide range of cases of distributional shifts as we do not make prior specifications on the causal structure or the source of mutation. Under this causal framework, we identify a set of invariant predictors based on the do-operator. We provide a sufficient and necessary condition for a predictor to be min-max optimal, i.e., minimizes the worst-case quadratic loss among all domains. This condition is justifiable under the Markovian and faithfulness assumptions, thus inspiring a practical algorithm to identify the optimal predictor. For empirical estimation, we propose a permutation-regeneration scheme guided by a local causal discovery procedure. The utility and effectiveness of our method are demonstrated in simulation data and two real-world applications: Alzheimer's disease diagnosis and gene function prediction.  ( 2 min )
    SC2EGSet: StarCraft II Esport Replay and Game-state Dataset. (arXiv:2207.03428v1 [cs.LG])
    As a relatively new form of sport, esports offers unparalleled data availability. Despite the vast amounts of data that are generated by game engines, it can be challenging to extract them and verify their integrity for the purposes of practical and scientific use. Our work aims to open esports to a broader scientific community by supplying raw and pre-processed files from StarCraft II esports tournaments. These files can be used in statistical and machine learning modeling tasks and related to various laboratory-based measurements (e.g., behavioral tests, brain imaging). We have gathered publicly available game-engine generated "replays" of tournament matches and performed data extraction and cleanup using a low-level application programming interface (API) parser library. Additionally, we open-sourced and published all the custom tools that were developed in the process of creating our dataset. These tools include PyTorch and PyTorch Lightning API abstractions to load and model the data. Our dataset contains replays from major and premiere StarCraft II tournaments since 2016. To prepare the dataset, we processed 55 tournament "replaypacks" that contained 17930 files with game-state information. Based on initial investigation of available StarCraft II datasets, we observed that our dataset is the largest publicly available source of StarCraft II esports data upon its publication. Analysis of the extracted data holds promise for further Artificial Intelligence (AI), Machine Learning (ML), psychological, Human-Computer Interaction (HCI), and sports-related studies in a variety of supervised and self-supervised tasks.  ( 3 min )
    Sequential estimation of quantiles with applications to A/B-testing and best-arm identification. (arXiv:1906.09712v5 [math.ST] UPDATED)
    We propose confidence sequences -- sequences of confidence intervals which are valid uniformly over time -- for quantiles of any distribution over a complete, fully-ordered set, based on a stream of i.i.d. observations. We give methods both for tracking a fixed quantile and for tracking all quantiles simultaneously. Specifically, we provide explicit expressions with small constants for intervals whose widths shrink at the fastest possible $\sqrt{t^{-1} \log\log t}$ rate, along with a non-asymptotic concentration inequality for the empirical distribution function which holds uniformly over time with the same rate. The latter strengthens Smirnov's empirical process law of the iterated logarithm and extends the Dvoretzky-Kiefer-Wolfowitz inequality to hold uniformly over time. We give a new algorithm and sample complexity bound for selecting an arm with an approximately best quantile in a multi-armed bandit framework. In simulations, our method requires fewer samples than existing methods by a factor of five to fifty.  ( 3 min )
    Pre-training helps Bayesian optimization too. (arXiv:2207.03084v1 [cs.LG])
    Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs on functions. However, even with expert knowledge, it is not an easy task to select a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.  ( 3 min )
    Multi-objective Optimization of Notifications Using Offline Reinforcement Learning. (arXiv:2207.03029v1 [cs.LG])
    Mobile notification systems play a major role in a variety of applications to communicate, send alerts and reminders to the users to inform them about news, events or messages. In this paper, we formulate the near-real-time notification decision problem as a Markov Decision Process where we optimize for multiple objectives in the rewards. We propose an end-to-end offline reinforcement learning framework to optimize sequential notification decisions. We address the challenge of offline learning using a Double Deep Q-network method based on Conservative Q-learning that mitigates the distributional shift problem and Q-value overestimation. We illustrate our fully-deployed system and demonstrate the performance and benefits of the proposed approach through both offline and online experiments.  ( 2 min )
    Lower Bounds on the Generalization Error of Nonlinear Learning Models. (arXiv:2103.14723v3 [stat.ML] UPDATED)
    We study in this paper lower bounds for the generalization error of models derived from multi-layer neural networks, in the regime where the size of the layers is commensurate with the number of samples in the training data. We show that unbiased estimators have unacceptable performance for such nonlinear networks in this regime. We derive explicit generalization lower bounds for general biased estimators, in the cases of linear regression and of two-layered networks. In the linear case the bound is asymptotically tight. In the nonlinear case, we provide a comparison of our bounds with an empirical study of the stochastic gradient descent algorithm. The analysis uses elements from the theory of large random matrices.  ( 2 min )
    Challenges and Pitfalls of Bayesian Unlearning. (arXiv:2207.03227v1 [cs.LG])
    Machine unlearning refers to the task of removing a subset of training data, thereby removing its contributions to a trained model. Approximate unlearning are one class of methods for this task which avoid the need to retrain the model from scratch on the retained data. Bayes' rule can be used to cast approximate unlearning as an inference problem where the objective is to obtain the updated posterior by dividing out the likelihood of deleted data. However this has its own set of challenges as one often doesn't have access to the exact posterior of the model parameters. In this work we examine the use of the Laplace approximation and Variational Inference to obtain the updated posterior. With a neural network trained for a regression task as the guiding example, we draw insights on the applicability of Bayesian unlearning in practical scenarios.  ( 2 min )
    On the Equivalence between Neural Network and Support Vector Machine. (arXiv:2111.06063v2 [stat.ML] UPDATED)
    Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) \citep{jacot2018neural}. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK \citep{arora2019exact}. However, the equivalence is only known for ridge regression currently \citep{arora2019harnessing}, while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalences between NNs and a broad family of $\ell_2$ regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) \textit{non-vacuous} generalization bound of NN via the corresponding KM; (ii) \textit{non-trivial} robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments is available at \url{https://github.com/leslie-CH/equiv-nn-svm}.  ( 3 min )
    Pre-trained Gaussian processes for Bayesian optimization. (arXiv:2109.08215v4 [cs.LG] UPDATED)
    Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs on functions. However, even with expert knowledge, it is not an easy task to select a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. Theoretically, we show a bounded regret of BO with pre-trained priors. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.  ( 3 min )
    A single $T$-gate makes distribution learning hard. (arXiv:2207.03140v1 [quant-ph])
    The task of learning a probability distribution from samples is ubiquitous across the natural sciences. The output distributions of local quantum circuits form a particularly interesting class of distributions, of key importance both to quantum advantage proposals and a variety of quantum machine learning algorithms. In this work, we provide an extensive characterization of the learnability of the output distributions of local quantum circuits. Our first result yields insight into the relationship between the efficient learnability and the efficient simulatability of these distributions. Specifically, we prove that the density modelling problem associated with Clifford circuits can be efficiently solved, while for depth $d=n^{\Omega(1)}$ circuits the injection of a single $T$-gate into the circuit renders this problem hard. This result shows that efficient simulatability does not imply efficient learnability. Our second set of results provides insight into the potential and limitations of quantum generative modelling algorithms. We first show that the generative modelling problem associated with depth $d=n^{\Omega(1)}$ local quantum circuits is hard for any learning algorithm, classical or quantum. As a consequence, one cannot use a quantum algorithm to gain a practical advantage for this task. We then show that, for a wide variety of the most practically relevant learning algorithms -- including hybrid-quantum classical algorithms -- even the generative modelling problem associated with depth $d=\omega(\log(n))$ Clifford circuits is hard. This result places limitations on the applicability of near-term hybrid quantum-classical generative modelling algorithms.  ( 3 min )
    On the instrumental variable estimation with many weak and invalid instruments. (arXiv:2207.03035v1 [stat.ME])
    We discuss the fundamental issue of identification in linear instrumental variable (IV) models with unknown IV validity. We revisit the popular majority and plurality rules and show that no identification condition can be "if and only if" in general. With the assumption of the "sparsest rule", which is equivalent to the plurality rule but becomes operational in computation algorithms, we investigate and prove the advantages of non-convex penalized approaches over other IV estimators based on two-step selections, in terms of selection consistency and accommodation for individually weak IVs. Furthermore, we propose a surrogate sparsest penalty that aligns with the identification condition and provides oracle sparse structure simultaneously. Desirable theoretical properties are derived for the proposed estimator with weaker IV strength conditions compared to the previous literature. Finite sample properties are demonstrated using simulations and the selection and estimation method is applied to an empirical study concerning the effect of trade on economic growth.  ( 2 min )
    A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits. (arXiv:2207.03106v1 [cs.LG])
    We study federated contextual linear bandits, where $M$ agents cooperate with each other to solve a global contextual linear bandit problem with the help of a central server. We consider the asynchronous setting, where all agents work independently and the communication between one agent and the server will not trigger other agents' communication. We propose a simple algorithm named \texttt{FedLinUCB} based on the principle of optimism. We prove that the regret of \texttt{FedLinUCB} is bounded by $\tilde{O}(d\sqrt{\sum_{m=1}^M T_m})$ and the communication complexity is $\tilde{O}(dM^2)$, where $d$ is the dimension of the contextual vector and $T_m$ is the total number of interactions with the environment by $m$-th agent. To the best of our knowledge, this is the first provably efficient algorithm that allows fully asynchronous communication for federated contextual linear bandits, while achieving the same regret guarantee as in the single-agent setting.  ( 2 min )
    Functional additive models on manifolds of planar shapes and forms. (arXiv:2109.02624v4 [stat.ME] UPDATED)
    The "shape" of a planar curve and/or landmark configuration is considered its equivalence class under translation, rotation and scaling, its "form" its equivalence class under translation and rotation while scale is preserved. We extend generalized additive regression to models for such shapes/forms as responses respecting the resulting quotient geometry by employing the squared geodesic distance as loss function and a geodesic response function to map the additive predictor to the shape/form space. For fitting the model, we propose a Riemannian $L_2$-Boosting algorithm well suited for a potentially large number of possibly parameter-intensive model terms, which also yields automated model selection. We provide novel intuitively interpretable visualizations for (even non-linear) covariate effects in the shape/form space via suitable tensor-product factorization. The usefulness of the proposed framework is illustrated in an analysis of 1) astragalus shapes of wild and domesticated sheep and 2) cell forms generated in a biophysical model, as well as 3) in a realistic simulation study with response shapes and forms motivated from a dataset on bottle outlines.  ( 2 min )
    Riemannian Diffusion Schr\"odinger Bridge. (arXiv:2207.03024v1 [stat.ML])
    Score-based generative models exhibit state of the art performance on density estimation and generative modeling tasks. These models typically assume that the data geometry is flat, yet recent extensions have been developed to synthesize data living on Riemannian manifolds. Existing methods to accelerate sampling of diffusion models are typically not applicable in the Riemannian setting and Riemannian score-based methods have not yet been adapted to the important task of interpolation of datasets. To overcome these issues, we introduce \emph{Riemannian Diffusion Schr\"odinger Bridge}. Our proposed method generalizes Diffusion Schr\"odinger Bridge introduced in \cite{debortoli2021neurips} to the non-Euclidean setting and extends Riemannian score-based models beyond the first time reversal. We validate our proposed method on synthetic data and real Earth and climate data.  ( 2 min )
    Provable Domain Generalization via Invariant-Feature Subspace Recovery. (arXiv:2201.12919v2 [cs.LG] UPDATED)
    Domain generalization asks for models trained over a set of training environments to perform well in unseen test environments. Recently, a series of algorithms such as Invariant Risk Minimization (IRM) has been proposed for domain generalization. However, Rosenfeld et al. (2021) shows that in a simple linear data model, even if non-convexity issues are ignored, IRM and its extensions cannot generalize to unseen environments with less than $d_s+1$ training environments, where $d_s$ is the dimension of the spurious-feature subspace. In this paper, we propose to achieve domain generalization with Invariant-feature Subspace Recovery (ISR). Our first algorithm, ISR-Mean, can identify the subspace spanned by invariant features from the first-order moments of the class-conditional distributions, and achieve provable domain generalization with $d_s+1$ training environments under the data model of Rosenfeld et al. (2021). Our second algorithm, ISR-Cov, further reduces the required number of training environments to $O(1)$ using the information of second-order moments. Notably, unlike IRM, our algorithms bypass non-convexity issues and enjoy global convergence guarantees. Empirically, our ISRs can obtain superior performance compared with IRM on synthetic benchmarks. In addition, on three real-world image and text datasets, we show that both ISRs can be used as simple yet effective post-processing methods to improve the worst-case accuracy of (pre-)trained models against spurious correlations and group shifts.  ( 3 min )
    Unsupervised Manifold Alignment with Joint Multidimensional Scaling. (arXiv:2207.02968v1 [stat.ML])
    We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment.  ( 2 min )
    Model Selection in Reinforcement Learning with General Function Approximations. (arXiv:2207.02992v1 [stat.ML])
    We consider model selection for classic Reinforcement Learning (RL) environments -- Multi Armed Bandits (MABs) and Markov Decision Processes (MDPs) -- under general function approximations. In the model selection framework, we do not know the function classes, denoted by $\mathcal{F}$ and $\mathcal{M}$, where the true models -- reward generating function for MABs and and transition kernel for MDPs -- lie, respectively. Instead, we are given $M$ nested function (hypothesis) classes such that true models are contained in at-least one such class. In this paper, we propose and analyze efficient model selection algorithms for MABs and MDPs, that \emph{adapt} to the smallest function class (among the nested $M$ classes) containing the true underlying model. Under a separability assumption on the nested hypothesis classes, we show that the cumulative regret of our adaptive algorithms match to that of an oracle which knows the correct function classes (i.e., $\cF$ and $\cM$) a priori. Furthermore, for both the settings, we show that the cost of model selection is an additive term in the regret having weak (logarithmic) dependence on the learning horizon $T$.  ( 2 min )
    The Union of Manifolds Hypothesis and its Implications for Deep Generative Modelling. (arXiv:2207.02862v1 [stat.ML])
    Deep learning has had tremendous success at learning low-dimensional representations of high-dimensional data. This success would be impossible if there was no hidden low-dimensional structure in data of interest; this existence is posited by the manifold hypothesis, which states that the data lies on an unknown manifold of low intrinsic dimension. In this paper, we argue that this hypothesis does not properly capture the low-dimensional structure typically present in data. Assuming the data lies on a single manifold implies intrinsic dimension is identical across the entire data space, and does not allow for subregions of this space to have a different number of factors of variation. To address this deficiency, we put forth the union of manifolds hypothesis, which accommodates the existence of non-constant intrinsic dimensions. We empirically verify this hypothesis on commonly-used image datasets, finding that indeed, intrinsic dimension should be allowed to vary. We also show that classes with higher intrinsic dimensions are harder to classify, and how this insight can be used to improve classification accuracy. We then turn our attention to the impact of this hypothesis in the context of deep generative models (DGMs). Most current DGMs struggle to model datasets with several connected components and/or varying intrinsic dimensions. To tackle these shortcomings, we propose clustered DGMs, where we first cluster the data and then train a DGM on each cluster. We show that clustered DGMs can model multiple connected components with different intrinsic dimensions, and empirically outperform their non-clustered counterparts without increasing computational requirements.  ( 3 min )

  • Open

    [R] Self-Modeling Programs: A Direct Approach to Program Likelihood
    PDF Link Abstract: In algorithmic information theory, the length of a program is used as a measure of its probability. This paper presents a category of programs that directly compute the combined probability of their own code symbols and input data. The probability of each symbol is computed from past symbols by requiring that execution of the program formed by the first n symbols returns a probability distribution over symbols for position n + 1. The program of this type with the highest likelihood ending in the input data sequence intuitively represents the most likely sequence of events that could have generated the data. Advantages of programs of this form and the relationship to the Kolmogorov complexity are discussed. I'd appreciate any criticisms or comments. submitted by /u/ml6189 [link] [comments]  ( 86 min )
    [R] Collecting survey responses for Machine Learning Report in Australia and New Zealand
    Hi everyone. I hope this post doesn't go against community or this subreddit's rules. Please remove if so (or point me in the right direction). The organisation I work for, DiUS, is conducting some research into Machine Learning. We're looking for people at all stages of their ML journey, from experimenting to applying, to complete a quick five minute survey. The results will inform our 2022 National Pulse Report, due to be published later this year. Please note: We are only looking for responses from those in Australia and New Zealand. For your time, we'll send you a copy of the published report and make a donation to our charity partner OzHarvest. Thanks! https://www.surveymonkey.com/r/dius_ml_survey?&utm_source=ml-reddit&utm_medium=display&utm_campaign=mlsurvey&utm_content=homepage submitted by /u/cj_td [link] [comments]  ( 86 min )
    [D] How to deal with badly labelled data?
    The labeling team at my organization is very bad. They take forever to understand the labeling objective. And produce datasets that are not very reliable. The take months to annotate a small dataset of roughly 2000 images. Now, I have 2 questions: How do I spot these anomalies? (Classification Dataset) How do I generate pseudo labels or use similar techniques to generate data for training? Should I complain about them to my manager or ask them to label the datasets again? Because this situation is getting out of hand submitted by /u/FnSK4R17s [link] [comments]  ( 86 min )
    [D] Paper Explained - JEPA: A Path Towards Autonomous Machine Intelligence (Video Walkthrough)
    https://youtu.be/jSdHmImyUjk Yann LeCun's position paper on a path towards machine intelligence combines Self-Supervised Learning, Energy-Based Models, and hierarchical predictive embedding models to arrive at a system that can teach itself to learn useful abstractions at multiple levels and use that as a world model to plan ahead in time. ​ OUTLINE: 0:00 - Introduction 2:00 - Main Contributions 5:45 - Mode 1 and Mode 2 actors 15:40 - Self-Supervised Learning and Energy-Based Models 20:15 - Introducing latent variables 25:00 - The problem of collapse 29:50 - Contrastive vs regularized methods 36:00 - The JEPA architecture 47:00 - Hierarchical JEPA (H-JEPA) 53:00 - Broader relevance 56:00 - Summary & Comments ​ Paper: https://openreview.net/forum?id=BZ5a1r-kVsf submitted by /u/ykilcher [link] [comments]  ( 86 min )
    [D] Current state of modeling uncertainty for Bayesian optimization?
    Hello, recently I've gotten into Bayesian Optimization and was looking to use it for a NAS use case using a gaussian process, but it seems to me that there are far more/better scaling options rather than using a GP, such as MC dropout on a regular NN, BNNs, NN ensembles, and SWAG (which I don't really understand). I would appreciate any advise on the advantages/disadvantages of these methods or direction to some survey/review of modeling uncertainty. Thanks in advance. submitted by /u/Nearby-Vehicle6622 [link] [comments]  ( 87 min )
    [D] An accusation of academic misconduct by Prof. Yisen Wang (Peking University) in ICML2021 and NeurIPS2021
    I recently noticed a Weibo (Chinese Twitter) thread of an alarming potential academic misconduct - Prof. Yisen Wang’s girlfriend accused him of cheating and collusion behaviors in recent top-tier machine learning conferences, including but may not limit to NeurIPS2021 and ICML2021. Yisen Wang (homepage: https://yisenwang.github.io/) obtained his Ph.D. degree at Tsinghua University (China) and is now an assistant professor at Peking University (China). Yisen is interested in adversarial attack, etc. Here are some facts from Yisen’s girlfriend’s post: [Cheating in best paper nomination in ICML 2021] In ICML2021, Yisen asked one area chair of ICML2021 to recommend his first PhD student Jingyi Cui’s paper to be best paper candidate(I am not sure if it is termed as “best paper candidate”, …  ( 93 min )
    [Discussion] About model serving for production
    Hi! I hope I'm not breaking any rules with this question. I'm studying some frameworks used in production for model serving, namely Seldon Core, Kubeflow and an academic artifact named Clipper. Some can manage the entire ML life cycle, but I have a question about serving in production. In particular, how would one go about actually batching multiple requests on the cloud? There doesn't seem to be a golden standard for it, so I'm assuming it depends on the size of the data and on the scope of model, right? For example, if the goal is image classification, it could be useful to have a cloud queue, right? If so, do you know what are some solutions that are actually used in production? submitted by /u/Mediocre-Piccolo7474 [link] [comments]  ( 86 min )
    [D] LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015
    Saw Schmidhuber’s tweeting again: 🔥 “Lecun’s 2022 paper on Autonomous Machine Intelligence rehashes but doesn’t cite essential work of 1990-2015. We’ve already published his “main original contributions:” learning subgoals, predictable abstract representations, multiple time scales…” Jürgen Schmidhuber’s response to Yann Lecun’s recent technical report / position paper “Autonomous Machine Intelligence” in this latest blog post: https://people.idsia.ch/~juergen/lecun-rehash-1990-2022.html An excerpt: On 14 June 2022, a science tabloid that published this article (24 June) on LeCun's report “A Path Towards Autonomous Machine Intelligence” (27 June) sent me a draft of the report (back then still under embargo) and asked for comments. I wrote a review (see below), telling them that this is essentially a rehash of our previous work that LeCun did not mention. My comments, however, fell on deaf ears. Now I am posting my not so enthusiastic remarks here such that the history of our field does not become further corrupted. The images below link to relevant blog posts from the AI Blog. I would like to start this by acknowledging that I am not without a conflict of interest here; my seeking to correct the record will naturally seem self-interested. The truth of the matter is that it is. Much of the closely related work pointed to below was done in my lab, and I naturally wish that it be acknowledged, and recognized. Setting my conflict aside, I ask the reader to study the original papers and judge for themselves the scientific content of these remarks, as I seek to set emotions aside and minimize bias so much as I am capable. For reference, previous discussion on r/MachineLearning about Yann Lecun’s paper: https://www.reddit.com/r/MachineLearning/comments/vm39oe/a_path_towards_autonomous_machine_intelligence/ submitted by /u/hardmaru [link] [comments]  ( 90 min )
    [D] Why do first layer filters in CNNs converge to edge-detector-like filters?
    I believe its well known that generally first layer filters in CNNs will converge to "edge-detector-like" shapes like this: shorturl.at/ANS78. This phenomenon is independent of the task from what I've seen - every large CNN backbone I've trained will converge to this given enough data. There is also research showing this type of edge detection happens in the visual cortex. Thus this edge detector phenomenon appears to be some fundamentally emergent property of the real world (+ maybe CNN type processors) Is there any compelling technical explanation for how SGD and its variants can reliably produce this convergence? I don't mean why edge detectors are "good" first stage filters - that intuitively makes sense to me. But rather, how is it that SGD can reliably produce this type of convergence on any dataset? I've been looking for a while for an explanation but couldn't find anything great. I was thinking that maybe there is some explanation using an assumption that edges are naturally "higher information" on raw images from the real world, and thus more directionally stepped towards in the gradient? But can't get the explanation to a satisfying state. submitted by /u/AeronByHermanMiller [link] [comments]  ( 90 min )
  • Open

    Using artificial intelligence, scientists recreate the smells of the past
    submitted by /u/ezikler [link] [comments]  ( 84 min )
    AI2’s PRIOR Team Introduces Unified-IO: The First Neural Model To Execute Various AI Tasks Spanning Classical Computer Vision, Image Synthesis, Vision-and-Language, and Natural Language Processing NLP
    Almost all industries are now using machine learning systems to improve the efficiency and dependability of their work. With the increasing use of ML, companies have seen a boom in the investments in the resources needed to support ML systems. Additionally, a single ML process necessitates the execution of numerous distinct models, further complicating the process and increasing costs. The idea of “Unified Models” was established in recent years, where a single model is constructed to power a process or product rather than a collection of connected but independent models. Combining all of the necessary data into one array and passing it to the model makes it possible to create a unified model that delivers all of the findings at once rather than by calling individual models one at a time. Continue reading | Check out the demo submitted by /u/ai-lover [link] [comments]  ( 85 min )
    AI Dream 60 - EPIC Cosmic Midjourney Expedition by AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 84 min )
    Nvidia Omniverse AI Predicts Alternate Future of The World | FIFA Uses Full Body Tracking AI | New Meta AI Translates 200 Languages With Highest Degree of Accuracy
    submitted by /u/tohelpyou88 [link] [comments]  ( 84 min )
    Midjourney Invites
    If anyone wants to get a midjourney invite, feel free to DM me. submitted by /u/xAnunnakix [link] [comments]  ( 84 min )
    It’s so hard to find motivation to finish your art sometimes especially when you work full time 😫
    submitted by /u/Legitimate_Run_6350 [link] [comments]  ( 84 min )
    AI: Diagnosis and Forecasting Spread of Infectious Diseases
    ​ AI is taking great strides in facilitating the way organizations are handling the pandemic. More so, the scope for AI professionals in healthcare sectors could be bountiful. submitted by /u/Emily-joe [link] [comments]  ( 84 min )
    AI Referee Will Track Players' Individual Limbs at World Cup
    submitted by /u/estasfuera [link] [comments]  ( 85 min )
    Quick analysis of the most in-demand jobs in AI/ML in 2022
    In short: Data Engineers are still the most sought-after professionals in the field (more engineering, less "modeling"?), demand for analysts and leadership (!) roles is on the rise. Full insights here: https://insights.ai-jobs.net/the-10-most-in-demand-jobs-in-ai-ml-and-big-data-in-2022/ submitted by /u/ai_jobs [link] [comments]  ( 84 min )
    Customizable Writing AI?
    This is totally a shot in the dark but I'm going for it anyway. Long story short, for kicks and giggles, I am trying to find a writing AI that allows you to input example writing of your choice to pull from rather than just give it a sentence prompt. I've spent an hour or so trying to google one up to no avail. Huge thanks in advance! submitted by /u/MustangLegends [link] [comments]  ( 85 min )
    Who needs a midjourney invite? Bc I got some left
    I dunno who to give these invites to so if anyone needs one I got you all!! submitted by /u/CombinationMammoth50 [link] [comments]  ( 84 min )
    Fairy's Pure Beauty | Raw Unscaled (FILM) | PYTTI 3D AI Art Animation
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 84 min )
    Is there an AI out there that takes an image and makes it more "realistic?"
    I know AI face generators can make pretty impressive images from scratch but I'm wondering if there is something that takes an image as input (like a video game screenshot or character) and spits out a more realistic version of it. I could imagine this would be a fun tool for 3D artists. Thanks! submitted by /u/ImPlento [link] [comments]  ( 85 min )
  • Open

    Drive efficiencies with CI/CD best practices on Amazon Lex
    Let’s say you have identified a use case in your organization that you would like to handle via a chatbot. You familiarized yourself with Amazon Lex, built a prototype, and did a few trial interactions with the bot. You liked the overall experience and now want to deploy the bot in your production environment, but […]  ( 7 min )
    Feature engineering at scale for healthcare and life sciences with Amazon SageMaker Data Wrangler
    Machine learning (ML) is disrupting a lot of industries at an unprecedented pace. The healthcare and life sciences (HCLS) industry has been going through a rapid evolution in recent years embracing ML across a multitude of use cases for delivering quality care and improving patient outcomes. In a typical ML lifecycle, data engineers and scientists […]  ( 17 min )
  • Open

    Mission-Driven: Takeaways From Our Corporate Responsibility Report
    NVIDIA’s latest corporate responsibility report shares our efforts in empowering employees and putting to work our technologies for the benefit of humanity. Amid ongoing global economic concerns and pandemic challenges, this year’s report highlights our ability to attract and retain talent that come here to do their life’s work while tackling some of the world’s Read article > The post Mission-Driven: Takeaways From Our Corporate Responsibility Report appeared first on NVIDIA Blog.  ( 7 min )
    GFN Thursday Brings New Games to GeForce NOW for the Perfect Summer Playlist
    Nothing beats the summer heat like GFN Thursday. Get ready for four new titles streaming at GeForce quality across nearly any device. Buckle up for some great gaming, whether poolside, in the car for a long road trip, or in the air-conditioned comfort of home. Speaking of summer, it’s also last call for this year’s Read article > The post GFN Thursday Brings New Games to GeForce NOW for the Perfect Summer Playlist appeared first on NVIDIA Blog.  ( 5 min )
    Wordle for AI: Santiago Valderrama on Getting Smarter on Machine Learning
    Want to learn about AI and machine learning? There are plenty of resources out there to help — blogs, podcasts, YouTube tutorials — perhaps too many. Machine learning engineer Santiago Valdarrama has taken a far more focused approach to helping us all get smarter about the field. He’s created a following by posing one machine Read article > The post Wordle for AI: Santiago Valderrama on Getting Smarter on Machine Learning appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Memory allocation problems in Stable Baselines3
    I'm trying to make an AI that finds the exit in a 50x50 maze using stable baselines3. The maze is represented by a 2d list where -1 means unexplored, 0 means empty space, 1 means wall and 2 means exit. There's another list on top of this one with the player's coordinates (so its a 3d list). It begins like this: self.pmp=[[-1]*50 for _ in range(50)] This is the AI's personal map, there also an objective map which is fully explored and it's added slowly to the personal map depending on the agent's coordinates. But every time I try to train the AI I get this error: model=DQN('MlpPolicy', env, verbose=1) numpy.core._exceptions._ArrayMemoryError: Unable to allocate 18.6 GiB for an array with shape (1000000, 1, 2, 50, 50) and data type int32 Not sure where the 1000000 came from. I tried saving memory by replacing that first bit of code with this: self.pmp=np.empty((50,50)) But it didn't do anything. Is there a way to reduce the memory this process takes up? submitted by /u/AnonCaptain0022 [link] [comments]  ( 85 min )
    ELI5: Braids counter example.
    Hi, I am really confused on the braids counter example: were does the state start, how does it affect function approximation, etc. Online searches either make me more confused or give me braided haircuts ELI5 maybe a bit too much, but could someone help explain to me Braids counter example. my best guess right now is that its like a cyclic import in python. submitted by /u/100M-900 [link] [comments]  ( 84 min )
    I have been reading about POMDP, but still confused between the differences in state, observation and belief. Can someone please explain it, with an example preferably.
    submitted by /u/aabra__ka__daabra [link] [comments]  ( 86 min )
    Where can I get pre trained machine learning models?
    submitted by /u/PopOk539 [link] [comments]  ( 84 min )
    RecurrentPPO (SB3-contrib) learning for autonomous driving
    Hi everyone! I'm a complete newbie to DRL, so please forgive my lack of understanding of some things on here. I'm training a recPPO from SB3-contrib on E.Leurent's Highway env [https://github.com/eleurent/highway-env] (I customized the action space to be more high-level). During training I get the desired behavioural outcome from the agent but I noticed that some training metrics of the model seem quite off respect to the trend found online (especially the explained variance).I just wanted an opinion from some more navigated fellas in here! Can I somehow fix this trend by hyperparameter tuning or do I have e.g. to modify the reward function somehow? How can I improve the training? For any details I'm always available. I share the tensorboard plots obtained for RecPPO. Fixed LR RecPPO ​ Linearly decreasing LR RecPPO P.S. with a fixed LR the model performs way better on the env it trained on and is very poor in exploitation on more complex envs (but it's ok, there are scenarios he couldn't have seen), while the one with decreasing LR performs poorly on the training env (crashes a lot) and does better in exploitation (but it has a weird way to navigate). Thank you in advance for the help! submitted by /u/pigopigu [link] [comments]  ( 85 min )
    Question about the old policy and new policy in TRPO code
    The code is a TRPO code. In this code, when "get_kl" , I can't understand the differences between the "mean0, log_std0, std0" and "mean1, log_std1, std1", aren't they equal in the code? And both the difference between the log_probs of old policy and new policy in the part of "get_loss" , aren't they equal in the code? Thanks for the help! submitted by /u/Snoopy9797 [link] [comments]  ( 86 min )
  • Open

    Enabling Creative Expression with Concept Activation Vectors
    Posted by Been Kim, Research Scientist, Google Research, Brain Team, and Alison Lentz, Senior Staff Strategist, Google Research, Mural Team Advances in computer vision and natural language processing continue to unlock new ways of exploring billions of images available on public and searchable websites. Today’s visual search tools make it possible to search with your camera, voice, text, images, or multiple modalities at the same time. However, it remains difficult to input subjective concepts, such as visual tones or moods, into current systems. For this reason, we have been working collaboratively with artists, photographers, and image researchers to explore how machine learning (ML) might enable people to use expressive queries as a way of visually exploring datasets. Today, we are i…  ( 22 min )
  • Open

    Nvidia Omniverse AI Predicts Alternate Future of The World | FIFA Uses Full Body Tracking AI | New Meta AI Translates 200 Languages With Highest Degree of Accuracy
    submitted by /u/tohelpyou88 [link] [comments]  ( 84 min )
    Where can I get pre trained machine learning models?
    submitted by /u/PopOk539 [link] [comments]  ( 84 min )
  • Open

    Sentient AI And The Turing Test — Did Google Engineer Prove Computers Can Have Feelings?
    One the biggest stories of the year in the AI community is about a Google engineer’s claim of sentient AI. This was part of Google’s LaMDA…  ( 21 min )
    Artificial Intelligences
    Artificial intelligences as our allies  ( 8 min )
  • Open

    AI4Science to empower the fifth paradigm of scientific discovery
    Over the coming decade, deep learning looks set to have a transformational impact on the natural sciences. The consequences are potentially far-reaching and could dramatically improve our ability to model and predict natural phenomena over widely varying scales of space and time. Could this capability represent the dawn of a new paradigm of scientific discovery? […] The post AI4Science to empower the fifth paradigm of scientific discovery appeared first on Microsoft Research.  ( 10 min )
  • Open

    Smart textiles sense how their users are moving
    Researchers develop a comfortable, form-fitting fabric that recognizes its wearer’s activities, like walking, running, and jumping.  ( 8 min )
  • Open

    AI-enhanced iterative solvers for accelerating the solution of large scale parametrized linear systems of equations. (arXiv:2207.02543v1 [math.NA])
    Recent advances in the field of machine learning open a new era in high performance computing. Applications of machine learning algorithms for the development of accurate and cost-efficient surrogates of complex problems have already attracted major attention from scientists. Despite their powerful approximation capabilities, however, surrogates cannot produce the `exact' solution to the problem. To address this issue, this paper exploits up-to-date ML tools and delivers customized iterative solvers of linear equation systems, capable of solving large-scale parametrized problems at any desired level of accuracy. Specifically, the proposed approach consists of the following two steps. At first, a reduced set of model evaluations is performed and the corresponding solutions are used to establish an approximate mapping from the problem's parametric space to its solution space using deep feedforward neural networks and convolutional autoencoders. This mapping serves a means to obtain very accurate initial predictions of the system's response to new query points at negligible computational cost. Subsequently, an iterative solver inspired by the Algebraic Multigrid method in combination with Proper Orthogonal Decomposition, termed POD-2G, is developed that successively refines the initial predictions towards the exact system solutions. The application of POD-2G as a standalone solver or as preconditioner in the context of preconditioned conjugate gradient methods is demonstrated on several numerical examples of large scale systems, with the results indicating its superiority over conventional iterative solution schemes.  ( 3 min )
    DIWIFT: Discovering Instance-wise Influential Features for Tabular Data. (arXiv:2207.02773v1 [cs.LG])
    Tabular data is one of the most common data storage formats in business applications, ranging from retail, bank and E-commerce. These applications rely heavily on machine learning models to achieve business success. One of the critical problems in learning tabular data is to distinguish influential features from all the predetermined features. Global feature selection has been well-studied for quite some time, assuming that all instances have the same influential feature subsets. However, different instances rely on different feature subsets in practice, which also gives rise to that instance-wise feature selection receiving increasing attention in recent studies. In this paper, we first propose a novel method for discovering instance-wise influential features for tabular data (DIWIFT), the core of which is to introduce the influence function to measure the importance of an instance-wise feature. DIWIFT is capable of automatically discovering influential feature subsets of different sizes in different instances, which is different from global feature selection that considers all instances with the same influential feature subset. On the other hand, different from the previous instance-wise feature selection, DIWIFT minimizes the validation loss on the validation set and is thus more robust to the distribution shift existing in the training dataset and test dataset, which is important in tabular data. Finally, we conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our DIWIFT, compared it with baseline methods. Moreover, we also demonstrate the robustness of our method via some ablation experiments.  ( 3 min )
    Clustering with Semidefinite Programming and Fixed Point Iteration. (arXiv:2012.09202v3 [math.OC] UPDATED)
    We introduce a novel method for clustering using a semidefinite programming (SDP) relaxation of the Max k-Cut problem. The approach is based on a new methodology for rounding the solution of an SDP relaxation using iterated linear optimization. We show the vertices of the Max k-Cut relaxation correspond to partitions of the data into at most k sets. We also show the vertices are attractive fixed points of iterated linear optimization. Each step of this iterative process solves a relaxation of the closest vertex problem and leads to a new clustering problem where the underlying clusters are more clearly defined. Our experiments show that using fixed point iteration for rounding the Max k-Cut SDP relaxation leads to significantly better results when compared to randomized rounding.  ( 2 min )
    When does Bias Transfer in Transfer Learning?. (arXiv:2207.02842v1 [cs.LG])
    Using transfer learning to adapt a pre-trained "source model" to a downstream "target task" can dramatically increase performance with seemingly no downside. In this work, we demonstrate that there can exist a downside after all: bias transfer, or the tendency for biases of the source model to persist even after adapting the model to the target class. Through a combination of synthetic and natural experiments, we show that bias transfer both (a) arises in realistic settings (such as when pre-training on ImageNet or other standard datasets) and (b) can occur even when the target dataset is explicitly de-biased. As transfer-learned models are increasingly deployed in the real world, our work highlights the importance of understanding the limitations of pre-trained source models. Code is available at https://github.com/MadryLab/bias-transfer  ( 2 min )
    A Tutorial on the Spectral Theory of Markov Chains. (arXiv:2207.02296v1 [cs.LG])
    Markov chains are a class of probabilistic models that have achieved widespread application in the quantitative sciences. This is in part due to their versatility, but is compounded by the ease with which they can be probed analytically. This tutorial provides an in-depth introduction to Markov chains, and explores their connection to graphs and random walks. We utilize tools from linear algebra and graph theory to describe the transition matrices of different types of Markov chains, with a particular focus on exploring properties of the eigenvalues and eigenvectors corresponding to these matrices. The results presented are relevant to a number of methods in machine learning and data mining, which we describe at various stages. Rather than being a novel academic study in its own right, this text presents a collection of known results, together with some new concepts. Moreover, the tutorial focuses on offering intuition to readers rather than formal understanding, and only assumes basic exposure to concepts from linear algebra and probability theory. It is therefore accessible to students and researchers from a wide variety of disciplines.  ( 2 min )
    A Deep Model for Partial Multi-Label Image Classification with Curriculum Based Disambiguation. (arXiv:2207.02410v1 [cs.CV])
    In this paper, we study the partial multi-label (PML) image classification problem, where each image is annotated with a candidate label set consists of multiple relevant labels and other noisy labels. Existing PML methods typically design a disambiguation strategy to filter out noisy labels by utilizing prior knowledge with extra assumptions, which unfortunately is unavailable in many real tasks. Furthermore, because the objective function for disambiguation is usually elaborately designed on the whole training set, it can be hardly optimized in a deep model with SGD on mini-batches. In this paper, for the first time we propose a deep model for PML to enhance the representation and discrimination ability. On one hand, we propose a novel curriculum based disambiguation strategy to progressively identify ground-truth labels by incorporating the varied difficulties of different classes. On the other hand, a consistency regularization is introduced for model retraining to balance fitting identified easy labels and exploiting potential relevant labels. Extensive experimental results on the commonly used benchmark datasets show the proposed method significantly outperforms the SOTA methods.  ( 2 min )
    Scaling Private Deep Learning with Low-Rank and Sparse Gradients. (arXiv:2207.02699v1 [cs.LG])
    Applying Differentially Private Stochastic Gradient Descent (DPSGD) to training modern, large-scale neural networks such as transformer-based models is a challenging task, as the magnitude of noise added to the gradients at each iteration scales with model dimension, hindering the learning capability significantly. We propose a unified framework, $\textsf{LSG}$, that fully exploits the low-rank and sparse structure of neural networks to reduce the dimension of gradient updates, and hence alleviate the negative impacts of DPSGD. The gradient updates are first approximated with a pair of low-rank matrices. Then, a novel strategy is utilized to sparsify the gradients, resulting in low-dimensional, less noisy updates that are yet capable of retaining the performance of neural networks. Empirical evaluation on natural language processing and computer vision tasks shows that our method outperforms other state-of-the-art baselines.  ( 2 min )
    Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users. (arXiv:2207.02726v1 [cs.LG])
    When using medical images for diagnosis, either by clinicians or artificial intelligence (AI) systems, it is important that the images are of high quality. When an image is of low quality, the medical exam that produced the image often needs to be redone. In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in order to have the exam redone. This can be especially difficult for people living in remote regions, who make up a substantial portion of the patients at Portal Telemedicina, a digital healthcare organization based in Brazil. In this paper, we report on ongoing work regarding (i) the development of an AI system for flagging and explaining low-quality medical images in real-time, (ii) an interview study to understand the explanation needs of stakeholders using the AI system at OurCompany, and, (iii) a longitudinal user study design to examine the effect of including explanations on the workflow of the technicians in our clinics. To the best of our knowledge, this would be the first longitudinal study on evaluating the effects of XAI methods on end-users -- stakeholders that use AI systems but do not have AI-specific expertise. We welcome feedback and suggestions on our experimental setup.  ( 3 min )
    Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction. (arXiv:2207.02724v1 [cs.LG])
    Molecular property prediction is essential in chemistry, especially for drug discovery applications. However, available molecular property data is often limited, encouraging the transfer of information from related data. Transfer learning has had a tremendous impact in fields like Computer Vision and Natural Language Processing signaling for its potential in molecular property prediction. We present a pre-training procedure for molecular representation learning using reaction data and use it to pre-train a SMILES Transformer. We fine-tune and evaluate the pre-trained model on 12 molecular property prediction tasks from MoleculeNet within physical chemistry, biophysics, and physiology and show a statistically significant positive effect on 5 of the 12 tasks compared to a non-pre-trained baseline model.  ( 2 min )
    Careful seeding for the k-medoids algorithm with incremental k++ cluster construction. (arXiv:2207.02404v1 [cs.LG])
    The k-medoids algorithm is a popular variant of the k-means algorithm and widely used in pattern recognition and machine learning. A main drawback of the k-medoids algorithm is that it can be trapped in local optima. An improved k-medoids algorithm (INCKM) was recently proposed to overcome this drawback, based on constructing a candidate medoids subset with a parameter choosing procedure, but it may fail when dealing with imbalanced datasets. In this paper, we propose a novel incremental k-medoids algorithm (INCKPP) which dynamically increases the number of clusters from 2 to k through a nonparametric and stochastic k-means++ search procedure. Our algorithm can overcome the parameter selection problem in the improved k-medoids algorithm, improve the clustering performance, and deal with imbalanced datasets very well. But our algorithm has a weakness in computation efficiency. To address this issue, we propose a fast INCKPP algorithm (called INCKPP$_{sample}$) which preserves the computational efficiency of the simple and fast k-medoids algorithm with an improved clustering performance. The proposed algorithm is compared with three state-of-the-art algorithms: the improved k-medoids algorithm (INCKM), the simple and fast k-medoids algorithm (FKM) and the k-means++ algorithm (KPP). Extensive experiments on both synthetic and real world datasets including imbalanced datasets illustrate the effectiveness of the proposed algorithm.  ( 2 min )
    Nonparametric Factor Trajectory Learning for Dynamic Tensor Decomposition. (arXiv:2207.02446v1 [cs.LG])
    Tensor decomposition is a fundamental framework to analyze data that can be represented by multi-dimensional arrays. In practice, tensor data is often accompanied by temporal information, namely the time points when the entry values were generated. This information implies abundant, complex temporal variation patterns. However, current methods always assume the factor representations of the entities in each tensor mode are static, and never consider their temporal evolution. To fill this gap, we propose NONparametric FActor Trajectory learning for dynamic tensor decomposition (NONFAT). We place Gaussian process (GP) priors in the frequency domain and conduct inverse Fourier transform via Gauss-Laguerre quadrature to sample the trajectory functions. In this way, we can overcome data sparsity and obtain robust trajectory estimates across long time horizons. Given the trajectory values at specific time points, we use a second-level GP to sample the entry values and to capture the temporal relationship between the entities. For efficient and scalable inference, we leverage the matrix Gaussian structure in the model, introduce a matrix Gaussian posterior, and develop a nested sparse variational learning algorithm. We have shown the advantage of our method in several real-world applications.  ( 2 min )
    Robust Counterfactual Explanations for Tree-Based Ensembles. (arXiv:2207.02739v1 [cs.LG])
    Counterfactual explanations inform ways to achieve a desired outcome from a machine learning model. However, such explanations are not robust to certain real-world changes in the underlying model (e.g., retraining the model, changing hyperparameters, etc.), questioning their reliability in several applications, e.g., credit lending. In this work, we propose a novel strategy -- that we call RobX -- to generate robust counterfactuals for tree-based ensembles, e.g., XGBoost. Tree-based ensembles pose additional challenges in robust counterfactual generation, e.g., they have a non-smooth and non-differentiable objective function, and they can change a lot in the parameter space under retraining on very similar data. We first introduce a novel metric -- that we call Counterfactual Stability -- that attempts to quantify how robust a counterfactual is going to be to model changes under retraining, and comes with desirable theoretical properties. Our proposed strategy RobX works with any counterfactual generation method (base method) and searches for robust counterfactuals by iteratively refining the counterfactual generated by the base method using our metric Counterfactual Stability. We compare the performance of RobX with popular counterfactual generation methods (for tree-based ensembles) across benchmark datasets. The results demonstrate that our strategy generates counterfactuals that are significantly more robust (nearly 100% validity after actual model changes) and also realistic (in terms of local outlier factor) over existing state-of-the-art methods.  ( 3 min )
    Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs. (arXiv:2207.02295v1 [cs.NI])
    Cloud datacenters are exponentially growing both in numbers and size. This increase results in a network activity surge that warrants better congestion avoidance. The resulting challenge is two-fold: (i) designing algorithms that can be custom-tuned to the complex traffic patterns of a given datacenter; but, at the same time (ii) run on low-level hardware with the required low latency of effective Congestion Control (CC). In this work, we present a Reinforcement Learning (RL) based CC solution that learns from certain traffic scenarios and successfully generalizes to others. We then distill the RL neural network policy into binary decision trees to achieve the desired $\mu$sec decision latency required for real-time inference with RDMA. We deploy the distilled policy on NVIDIA NICs in a real network and demonstrate state-of-the-art performance, balancing all tested metrics simultaneously: bandwidth, latency, fairness, and packet drops.  ( 2 min )
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v2 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Unsupervised Recurrent Federated Learning for Edge Popularity Prediction in Privacy-Preserving Mobile Edge Computing Networks. (arXiv:2207.00755v2 [cs.MM] UPDATED)
    Nowadays wireless communication is rapidly reshaping entire industry sectors. In particular, mobile edge computing (MEC) as an enabling technology for industrial Internet of things (IIoT) brings powerful computing/storage infrastructure closer to the mobile terminals and, thereby, significant lowers the response latency. To reap the benefit of proactive caching at the network edge, precise knowledge on the popularity pattern among the end devices is essential. However, the complex and dynamic nature of the content popularity over space and time as well as the data-privacy requirements in many IIoT scenarios pose tough challenges to its acquisition. In this article, we propose an unsupervised and privacy-preserving popularity prediction framework for MEC-enabled IIoT. The concepts of local and global popularities are introduced and the time-varying popularity of each user is modelled as a model-free Markov chain. On this basis, a novel unsupervised recurrent federated learning (URFL) algorithm is proposed to predict the distributed popularity while achieve privacy preservation and unsupervised training. Simulations indicate that the proposed framework can enhance the prediction accuracy in terms of a reduced root-mean-squared error by up to $60.5\%-68.7\%$. Additionally, manual labeling and violation of users' data privacy are both avoided.
    Progressive Latent Replay for efficient Generative Rehearsal. (arXiv:2207.01562v2 [cs.CV] UPDATED)
    We introduce a new method for internal replay that modulates the frequency of rehearsal based on the depth of the network. While replay strategies mitigate the effects of catastrophic forgetting in neural networks, recent works on generative replay show that performing the rehearsal only on the deeper layers of the network improves the performance in continual learning. However, the generative approach introduces additional computational overhead, limiting its applications. Motivated by the observation that earlier layers of neural networks forget less abruptly, we propose to update network layers with varying frequency using intermediate-level features during replay. This reduces the computational burden by omitting computations for both deeper layers of the generator and earlier layers of the main model. We name our method Progressive Latent Replay and show that it outperforms Internal Replay while using significantly fewer resources.
    Flow Completion Network: Inferring the Fluid Dynamics from Incomplete Flow Information using Graph Neural Networks. (arXiv:2205.04739v2 [physics.flu-dyn] UPDATED)
    This paper introduces a novel neural network - flow completion network (FCN) - to infer the fluid dynamics, includ-ing the flow field and the force acting on the body, from the incomplete data based on Graph Convolution AttentionNetwork. The FCN is composed of several graph convolution layers and spatial attention layers. It is designed to inferthe velocity field and the vortex force contribution of the flow field when combined with the vortex force map (VFM)method. Compared with other neural networks adopted in fluid dynamics, the FCN is capable of dealing with bothstructured data and unstructured data. The performance of the proposed FCN is assessed by the computational fluiddynamics (CFD) data on the flow field around a circular cylinder. The force coefficients predicted by our model arevalidated against those obtained directly from CFD. Moreover, it is shown that our model effectively utilizes the exist-ing flow field information and the gradient information simultaneously, giving a better performance than the traditionalconvolution neural network (CNN)-based and deep neural network (DNN)-based models. Specifically, among all thecases of different Reynolds numbers and different proportions of the training dataset, the results show that the proposedFCN achieves a maximum norm mean square error of 5.86% in the test dataset, which is much lower than those of thetraditional CNN-based and DNN-based models (42.32% and 15.63% respectively).
    The rise of the lottery heroes: why zero-shot pruning is hard. (arXiv:2202.12400v2 [cs.LG] UPDATED)
    Recent advances in deep learning optimization showed that just a subset of parameters are really necessary to successfully train a model. Potentially, such a discovery has broad impact from the theory to application; however, it is known that finding these trainable sub-network is a typically costly process. This inhibits practical applications: can the learned sub-graph structures in deep learning models be found at training time? In this work we explore such a possibility, observing and motivating why common approaches typically fail in the extreme scenarios of interest, and proposing an approach which potentially enables training with reduced computational effort. The experiments on either challenging architectures and datasets suggest the algorithmic accessibility over such a computational gain, and in particular a trade-off between accuracy achieved and training complexity deployed emerges.
    Motley: Benchmarking Heterogeneity and Personalization in Federated Learning. (arXiv:2206.09262v2 [cs.LG] UPDATED)
    Personalized federated learning considers learning models unique to each client in a heterogeneous network. The resulting client-specific models have been purported to improve metrics such as accuracy, fairness, and robustness in federated networks. However, despite a plethora of work in this area, it remains unclear: (1) which personalization techniques are most effective in various settings, and (2) how important personalization truly is for realistic federated applications. To better answer these questions, we propose Motley, a benchmark for personalized federated learning. Motley consists of a suite of cross-device and cross-silo federated datasets from varied problem domains, as well as thorough evaluation metrics for better understanding the possible impacts of personalization. We establish baselines on the benchmark by comparing a number of representative personalized federated learning methods. These initial results highlight strengths and weaknesses of existing approaches, and raise several open questions for the community. Motley aims to provide a reproducible means with which to advance developments in personalized and heterogeneity-aware federated learning, as well as the related areas of transfer learning, meta-learning, and multi-task learning.
    Adversarially Trained Actor Critic for Offline Reinforcement Learning. (arXiv:2202.02446v2 [cs.LG] UPDATED)
    We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
    Unfolding AIS transmission behavior for vessel movement modeling on noisy data leveraging machine learning. (arXiv:2202.13867v2 [cs.LG] UPDATED)
    The oceans are a source of an impressive mixture of complex data that could be used to uncover relationships yet to be discovered. Such data comes from the oceans and their surface, such as Automatic Identification System (AIS) messages used for tracking vessels' trajectories. AIS messages are transmitted over radio or satellite at ideally periodic time intervals but vary irregularly over time. As such, this paper aims to model the AIS message transmission behavior through neural networks for forecasting upcoming AIS messages' content from multiple vessels, particularly in a simultaneous approach despite messages' temporal irregularities as outliers. We present a set of experiments comprising multiple algorithms for forecasting tasks with horizon sizes of varying lengths. Deep learning models (e.g., neural networks) revealed themselves to adequately preserve vessels' spatial awareness regardless of temporal irregularity. We show how convolutional layers, feed-forward networks, and recurrent neural networks can improve such tasks by working together. Experimenting with short, medium, and large-sized sequences of messages, our model achieved 36/37/38% of the Relative Percentage Difference - the lower, the better, whereas we observed 92/45/96% on the Elman's RNN, 51/52/40% on the GRU, and 129/98/61% on the LSTM. These results support our model as a driver for improving the prediction of vessel routes when analyzing multiple vessels of diverging types simultaneously under temporally noise data.
    SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy. (arXiv:2203.17001v2 [eess.AS] UPDATED)
    Deep learning based singing voice synthesis (SVS) systems have been demonstrated to flexibly generate singing with better qualities, compared to conventional statistical parametric based methods. However, neural systems are generally data-hungry and have difficulty to reach reasonable singing quality with limited public available training data. In this work, we explore different data augmentation methods to boost the training of SVS systems, including several strategies customized to SVS based on pitch augmentation and mix-up augmentation. To further stabilize the training, we introduce the cycle-consistent training strategy. Extensive experiments on two public singing databases demonstrate that our proposed augmentation methods and the stabilizing training strategy can significantly improve the performance on both objective and subjective evaluations.
    ADAST: Attentive Cross-domain EEG-based Sleep Staging Framework with Iterative Self-Training. (arXiv:2107.04470v4 [cs.LG] UPDATED)
    Sleep staging is of great importance in the diagnosis and treatment of sleep disorders. Recently, numerous data-driven deep learning models have been proposed for automatic sleep staging. They mainly train the model on a large public labeled sleep dataset and test it on a smaller one with subjects of interest. However, they usually assume that the train and test data are drawn from the same distribution, which may not hold in real-world scenarios. Unsupervised domain adaption (UDA) has been recently developed to handle this domain shift problem. However, previous UDA methods applied for sleep staging have two main limitations. First, they rely on a totally shared model for the domain alignment, which may lose the domain-specific information during feature extraction. Second, they only align the source and target distributions globally without considering the class information in the target domain, which hinders the classification performance of the model while testing. In this work, we propose a novel adversarial learning framework called ADAST to tackle the domain shift problem in the unlabeled target domain. First, we develop an unshared attention mechanism to preserve the domain-specific features in both domains. Second, we design an iterative self-training strategy to improve the classification performance on the target domain via target domain pseudo labels. We also propose dual distinct classifiers to increase the robustness and quality of the pseudo labels. The experimental results on six cross-domain scenarios validate the efficacy of our proposed framework and its advantage over state-of-the-art UDA methods. The source code is available at https://github.com/emadeldeen24/ADAST.
    Fast Density Estimation for Density-based Clustering Methods. (arXiv:2109.11383v3 [cs.LG] UPDATED)
    Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms are heavily dominated by finding fixed-radius near neighbors and calculating the density, which is time-consuming. Meanwhile, the traditional acceleration methods using indexing technique such as KD tree is not effective in processing high-dimensional data. In this paper, we propose a fast region query algorithm named fast principal component analysis pruning (called FPCAP) with the help of the fast principal component analysis technique in conjunction with geometric information provided by principal attributes of the data, which can process high-dimensional data and be easily applied to density-based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. As an application in density-based clustering methods, FPCAP method was combined with the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And then, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on seven benchmark datasets demonstrate that the proposed algorithm improves the computational efficiency significantly.
    On the Effects of Artificial Data Modification. (arXiv:2110.13968v2 [cs.LG] UPDATED)
    Data distortion is commonly applied in vision models during both training (e.g methods like MixUp and CutMix) and evaluation (e.g. shape-texture bias and robustness). This data modification can introduce artificial information. It is often assumed that the resulting artefacts are detrimental to training, whilst being negligible when analysing models. We investigate these assumptions and conclude that in some cases they are unfounded and lead to incorrect results. Specifically, we show current shape bias identification methods and occlusion robustness measures are biased and propose a fairer alternative for the latter. Subsequently, through a series of experiments we seek to correct and strengthen the community's perception of how augmenting affects learning of vision models. Based on our empirical results we argue that the impact of the artefacts must be understood and exploited rather than eliminated.
    MoTiAC: Multi-Objective Actor-Critics for Real-Time Bidding. (arXiv:2002.07408v2 [cs.AI] UPDATED)
    Online Real-Time Bidding (RTB) is a complex auction game among which advertisers struggle to bid for ad impressions when a user request occurs. Considering display cost, Return on Investment (ROI), and other influential Key Performance Indicators (KPIs), large ad platforms try to balance the trade-off among various goals in dynamics. To address the challenge, we propose a Multi-ObjecTive Actor-Critics algorithm based on reinforcement learning (RL), named MoTiAC, for the problem of bidding optimization with various goals. In MoTiAC, objective-specific agents update the global network asynchronously with different goals and perspectives, leading to a robust bidding policy. Unlike previous RL models, the proposed MoTiAC can simultaneously fulfill multi-objective tasks in complicated bidding environments. In addition, we mathematically prove that our model will converge to Pareto optimality. Finally, experiments on a large-scale real-world commercial dataset from Tencent verify the effectiveness of MoTiAC versus a set of recent approaches
    Enhancing Adversarial Attacks on Single-Layer NVM Crossbar-Based Neural Networks with Power Consumption Information. (arXiv:2207.02764v1 [cs.LG])
    Adversarial attacks on state-of-the-art machine learning models pose a significant threat to the safety and security of mission-critical autonomous systems. This paper considers the additional vulnerability of machine learning models when attackers can measure the power consumption of their underlying hardware platform. In particular, we explore the utility of power consumption information for adversarial attacks on non-volatile memory crossbar-based single-layer neural networks. Our results from experiments with MNIST and CIFAR-10 datasets show that power consumption can reveal important information about the neural network's weight matrix, such as the 1-norm of its columns. That information can be used to infer the sensitivity of the network's loss with respect to different inputs. We also find that surrogate-based black box attacks that utilize crossbar power information can lead to improved attack efficiency.
    Landscape analysis for shallow neural networks: complete classification of critical points for affine target functions. (arXiv:2103.10922v3 [cs.LG] UPDATED)
    In this paper, we analyze the landscape of the true loss of neural networks with one hidden layer and ReLU, leaky ReLU, or quadratic activation. In all three cases, we provide a complete classification of the critical points in the case where the target function is affine and one-dimensional. In particular, we show that there exist no local maxima and clarify the structure of saddle points. Moreover, we prove that non-global local minima can only be caused by `dead' ReLU neurons. In particular, they do not appear in the case of leaky ReLU or quadratic activation. Our approach is of a combinatorial nature and builds on a careful analysis of the different types of hidden neurons that can occur.
    Epistemic Neural Networks. (arXiv:2107.08924v5 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the epistemic neural network (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    Topological Information Retrieval with Dilation-Invariant Bottleneck Comparative Measures. (arXiv:2104.01672v3 [stat.ML] UPDATED)
    Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously characterize a database in terms of both its hierarchy and connectivity structure. Computing persistent homology on a variety of embedded datasets reveals that some commonly used embeddings fail to preserve the connectivity. We show that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect: in particular, they address the issue of metric distortion on manifolds. We provide an algorithm for their computation that exhibits greatly reduced time complexity over existing methods. We use these measures to perform the first instance of topology-based information retrieval and demonstrate its increased performance over the standard bottleneck distance for persistent homology. We showcase our approach on databases of different data varieties including text, videos, and medical images.
    NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks. (arXiv:2110.05668v4 [cs.CV] UPDATED)
    Most existing neural architecture search (NAS) benchmarks and algorithms prioritize well-studied tasks, e.g. image classification on CIFAR or ImageNet. This makes the performance of NAS approaches in more diverse areas poorly understood. In this paper, we present NAS-Bench-360, a benchmark suite to evaluate methods on domains beyond those traditionally studied in architecture search, and use it to address the following question: do state-of-the-art NAS methods perform well on diverse tasks? To construct the benchmark, we curate ten tasks spanning a diverse array of application domains, dataset sizes, problem dimensionalities, and learning objectives. Each task is carefully chosen to interoperate with modern CNN-based search methods while possibly being far-afield from its original development domain. To speed up and reduce the cost of NAS research, for two of the tasks we release the precomputed performance of 15,625 architectures comprising a standard CNN search space. Experimentally, we show the need for more robust NAS evaluation of the kind NAS-Bench-360 enables by showing that several modern NAS procedures perform inconsistently across the ten tasks, with many catastrophically poor results. We also demonstrate how NAS-Bench-360 and its associated precomputed results will enable future scientific discoveries by testing whether several recent hypotheses promoted in the NAS literature hold on diverse tasks. NAS-Bench-360 is hosted at https://nb360.ml.cmu.edu.
    Machine Learning for Stuttering Identification: Review, Challenges and Future Directions. (arXiv:2107.04057v3 [cs.SD] UPDATED)
    Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary pauses and repetition of sounds. Stuttering identification is an interesting interdisciplinary domain research problem which involves pathology, psychology, acoustics, and signal processing that makes it hard and complicated to detect. Recent developments in machine and deep learning have dramatically revolutionized speech domain, however minimal attention has been given to stuttering identification. This work fills the gap by trying to bring researchers together from interdisciplinary fields. In this paper, we review comprehensively acoustic features, statistical and deep learning based stuttering/disfluency classification methods. We also present several challenges and possible future directions.
    Graph Trees with Attention. (arXiv:2207.02760v1 [cs.LG])
    When dealing with tabular data, models based on regression and decision trees are a popular choice due to the high accuracy they provide on such tasks and their ease of application as compared to other model classes. Yet, when it comes to graph-structure data, current tree learning algorithms do not provide tools to manage the structure of the data other than relying on feature engineering. In this work we address the above gap, and introduce Graph Trees with Attention (GTA), a new family of tree-based learning algorithms that are designed to operate on graphs. GTA leverages both the graph structure and the features at the vertices and employs an attention mechanism that allows decisions to concentrate on sub-structures of the graph. We analyze GTA models and show that they are strictly more expressive than plain decision trees. We also demonstrate the benefits of GTA empirically on multiple graph and node prediction benchmarks. In these experiments, GTA always outperformed other tree-based models and often outperformed other types of graph-learning algorithms such as Graph Neural Networks (GNNs) and Graph Kernels. Finally, we also provide an explainability mechanism for GTA, and demonstrate it can provide intuitive explanations.
    Improved conformalized quantile regression. (arXiv:2207.02808v1 [stat.ML])
    Conformalized quantile regression is a procedure that inherits the advantages of conformal prediction and quantile regression. That is, we use quantile regression to estimate the true conditional quantile and then apply a conformal step on a calibration set to ensure marginal coverage. In this way, we get adaptive prediction intervals that account for heteroscedasticity. However, the aforementioned conformal step lacks adaptiveness as described in (Romano et al., 2019). To overcome this limitation, instead of applying a single conformal step after estimating conditional quantiles with quantile regression, we propose to cluster the explanatory variables weighted by their permutation importance with an optimized k-means and apply k conformal steps. To show that this improved version outperforms the classic version of conformalized quantile regression and is more adaptive to heteroscedasticity, we extensively compare the prediction intervals of both in open datasets.
    Avoiding Forgetting and Allowing Forward Transfer in Continual Learning via Sparse Networks. (arXiv:2110.05329v3 [cs.LG] UPDATED)
    Using task-specific components within a neural network in continual learning (CL) is a compelling strategy to address the stability-plasticity dilemma in fixed-capacity models without access to past data. Current methods focus only on selecting a sub-network for a new task that reduces forgetting of past tasks. However, this selection could limit the forward transfer of relevant past knowledge that helps in future learning. Our study reveals that satisfying both objectives jointly is more challenging when a unified classifier is used for all classes of seen tasks-class-Incremental Learning (class-IL)-as it is prone to ambiguities between classes across tasks. Moreover, the challenge increases when the semantic similarity of classes across tasks increases. To address this challenge, we propose a new CL method, named AFAF, that aims to Avoid Forgetting and Allow Forward transfer in class-IL using fix-capacity models. AFAF allocates a sub-network that enables selective transfer of relevant knowledge to a new task while preserving past knowledge, reusing some of the previously allocated components to utilize the fixed-capacity, and addressing class-ambiguities when similarities exist. The experiments show the effectiveness of AFAF in providing models with multiple CL desirable properties, while outperforming state-of-the-art methods on various challenging benchmarks with different semantic similarities.
    Novel Techniques to Assess Predictive Systems and Reduce Their Alarm Burden. (arXiv:2102.05691v3 [cs.LG] UPDATED)
    Machine prediction algorithms (e.g., binary classifiers) often are adopted on the basis of claimed performance using classic metrics such as sensitivity and predictive value. However, classifier performance depends heavily upon the context (workflow) in which the classifier operates. Classic metrics do not reflect the realized utility of a predictor unless certain implicit assumptions are met, and these assumptions cannot be met in many common clinical scenarios. This often results in suboptimal implementations and in disappointment when expected outcomes are not achieved. One common failure mode for classic metrics arises when multiple predictions can be made for the same event, particularly when redundant true positive predictions produce little additional value. This describes many clinical alerting systems. We explain why classic metrics cannot correctly represent predictor performance in such contexts, and introduce an improved performance assessment technique using utility functions to score predictions based on their utility in a specific workflow context. The resulting utility metrics (u-metrics) explicitly account for the effects of temporal relationships on prediction utility. Compared to traditional measures, u-metrics more accurately reflect the real world costs and benefits of a predictor operating in a live clinical context. The improvement can be significant. We also describe a formal approach to snoozing, a mitigation strategy in which some predictions are suppressed to improve predictor performance by reducing false positives while retaining event capture. Snoozing is especially useful for predictors that generate interruptive alarms. U-metrics correctly measure and predict the performance benefits of snoozing, whereas traditional metrics do not.
    DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. (arXiv:2108.05877v5 [cs.LG] UPDATED)
    While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation. In this paper, we propose a new platform and pipeline DexMV (Dexterous Manipulation from Videos) for imitation learning. We design a platform with: (i) a simulation system for complex dexterous manipulation tasks with a multi-finger robot hand and (ii) a computer vision system to record large-scale demonstrations of a human hand conducting the same tasks. In our novel pipeline, we extract 3D hand and object poses from videos, and propose a novel demonstration translation method to convert human motion to robot demonstrations. We then apply and benchmark multiple imitation learning algorithms with the demonstrations. We show that the demonstrations can indeed improve robot learning by a large margin and solve the complex tasks which reinforcement learning alone cannot solve. More details can be found in the project page: https://yzqin.github.io/dexmv
    Histopathology DatasetGAN: Synthesizing Large-Resolution Histopathology Datasets. (arXiv:2207.02712v1 [eess.IV])
    Self-supervised learning (SSL) methods are enabling an increasing number of deep learning models to be trained on image datasets in domains where labels are difficult to obtain. These methods, however, struggle to scale to the high resolution of medical imaging datasets, where they are critical for achieving good generalization on label-scarce medical image datasets. In this work, we propose the Histopathology DatasetGAN (HDGAN) framework, an extension of the DatasetGAN semi-supervised framework for image generation and segmentation that scales well to large-resolution histopathology images. We make several adaptations from the original framework, including updating the generative backbone, selectively extracting latent features from the generator, and switching to memory-mapped arrays. These changes reduce the memory consumption of the framework, improving its applicability to medical imaging domains. We evaluate HDGAN on a thrombotic microangiopathy high-resolution tile dataset, demonstrating strong performance on the high-resolution image-annotation generation task. We hope that this work enables more application of deep learning models to medical datasets, in addition to encouraging more exploration of self-supervised frameworks within the medical imaging domain.
    Learning with Neighbor Consistency for Noisy Labels. (arXiv:2202.02200v2 [cs.CV] UPDATED)
    Recent advances in deep learning have relied on large, labelled datasets to train high-capacity models. However, collecting large datasets in a time- and cost-efficient manner often results in label noise. We present a method for learning from noisy labels that leverages similarities between training examples in feature space, encouraging the prediction of each example to be similar to its nearest neighbours. Compared to training algorithms that use multiple models or distinct stages, our approach takes the form of a simple, additional regularization term. It can be interpreted as an inductive version of the classical, transductive label propagation algorithm. We thoroughly evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, WebVision, Clothing1M, mini-ImageNet-Red) noise, and achieve competitive or state-of-the-art accuracies across all of them.
    BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization. (arXiv:2207.02763v1 [cs.LG])
    In this paper, a new gradient-based optimization approach by automatically adjusting the learning rate is proposed. This approach can be applied to design non-adaptive learning rate and adaptive learning rate. Firstly, I will introduce the non-adaptive learning rate optimization method: Binary Forward Exploration (BFE), and then the corresponding adaptive per-parameter learning rate method: Adaptive BFE (AdaBFE) is possible to be developed. This approach could be an alternative method to optimize the learning rate based on the stochastic gradient descent (SGD) algorithm besides the current non-adaptive learning rate methods e.g. SGD, momentum, Nesterov and the adaptive learning rate methods e.g. AdaGrad, AdaDelta, Adam... The purpose to develop this approach is not to beat the benchmark of other methods but just to provide a different perspective to optimize the gradient descent method, although some comparative study with previous methods will be made in the following sections. This approach is expected to be heuristic or inspire researchers to improve gradient-based optimization combined with previous methods.
    Architectural Optimization and Feature Learning for High-Dimensional Time Series Datasets. (arXiv:2202.13486v2 [cs.LG] UPDATED)
    As our ability to sense increases, we are experiencing a transition from data-poor problems, in which the central issue is a lack of relevant data, to data-rich problems, in which the central issue is to identify a few relevant features in a sea of observations. Motivated by applications in gravitational-wave astrophysics, we study the problem of predicting the presence of transient noise artifacts in a gravitational wave detector from a rich collection of measurements from the detector and its environment. We argue that feature learning--in which relevant features are optimized from data--is critical to achieving high accuracy. We introduce models that reduce the error rate by over 60% compared to the previous state of the art, which used fixed, hand-crafted features. Feature learning is useful not only because it improves performance on prediction tasks; the results provide valuable information about patterns associated with phenomena of interest that would otherwise be undiscoverable. In our application, features found to be associated with transient noise provide diagnostic information about its origin and suggest mitigation strategies. Learning in high-dimensional settings is challenging. Through experiments with a variety of architectures, we identify two key factors in successful models: sparsity, for selecting relevant variables within the high-dimensional observations; and depth, which confers flexibility for handling complex interactions and robustness with respect to temporal variations. We illustrate their significance through systematic experiments on real detector data. Our results provide experimental corroboration of common assumptions in the machine-learning community and have direct applicability to improving our ability to sense gravitational waves, as well as to many other problem settings with similarly high-dimensional, noisy, or partly irrelevant data.
    Self-supervised Detransformation Autoencoder for Representation Learning in Open Set Recognition. (arXiv:2105.13557v2 [cs.LG] UPDATED)
    The objective of Open set recognition (OSR) is to learn a classifier that can reject the unknown samples while classifying the known classes accurately. In this paper, we propose a self-supervision method, Detransformation Autoencoder (DTAE), for the OSR problem. This proposed method engages in learning representations that are invariant to the transformations of the input data. Experiments on several standard image datasets indicate that the pre-training process significantly improves the model performance in the OSR tasks. Meanwhile, our proposed self-supervision method achieves significant gains in detecting the unknown class and classifying the known classes. Moreover, our analysis indicates that DTAE can yield representations that contain more target class information and less transformation information than RotNet.
    Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture. (arXiv:2112.08534v2 [cs.LG] UPDATED)
    We introduce the Momentum Transformer, an attention-based deep learning architecture which outperforms benchmark momentum and mean-reversion trading strategies. Unlike state-of-the-art Long Short-Term Memory (LSTM) architectures, which are sequential in nature, the attention mechanism provides our architecture with a direct connection to all previous time-steps. Our architecture enables us to learn longer-term dependencies, improves performance when considering returns net of transaction costs and naturally adapts to new market regimes, such as during the SARS-CoV-2 crisis. The Momentum Transformer is inherently interpretable, providing us with greater insights into our deep learning momentum trading strategy, including how it blends different classical strategies and the past time-steps which are of the greatest significance to the model.
    Detecting and Diagnosing Terrestrial Gravitational-Wave Mimics Through Feature Learning. (arXiv:2203.05086v2 [astro-ph.IM] UPDATED)
    As engineered systems grow in complexity, there is an increasing need for automatic methods that can detect, diagnose, and even correct transient anomalies that inevitably arise and can be difficult or impossible to diagnose and fix manually. Among the most sensitive and complex systems of our civilization are the detectors that search for incredibly small variations in distance caused by gravitational waves -- phenomena originally predicted by Albert Einstein to emerge and propagate through the universe as the result of collisions between black holes and other massive objects in deep space. The extreme complexity and precision of such detectors causes them to be subject to transient noise issues that can significantly limit their sensitivity and effectiveness. In this work, we present a demonstration of a method that can detect and characterize emergent transient anomalies of such massively complex systems. We illustrate the performance, precision, and adaptability of the automated solution via one of the prevalent issues limiting gravitational-wave discoveries: noise artifacts of terrestrial origin that contaminate gravitational wave observatories' highly sensitive measurements and can obscure or even mimic the faint astrophysical signals for which they are listening. Specifically, we demonstrate how a highly interpretable convolutional classifier can automatically learn to detect transient anomalies from auxiliary detector data without needing to observe the anomalies themselves. We also illustrate several other useful features of the model, including how it performs automatic variable selection to reduce tens of thousands of auxiliary data channels to only a few relevant ones; how it identifies behavioral signatures predictive of anomalies in those channels; and how it can be used to investigate individual anomalies and the channels associated with them.
    Stochastic normalizing flows as non-equilibrium transformations. (arXiv:2201.08862v3 [hep-lat] UPDATED)
    Normalizing flows are a class of deep generative models that provide a promising route to sample lattice field theories more efficiently than conventional Monte Carlo simulations. In this work we show that the theoretical framework of stochastic normalizing flows, in which neural-network layers are combined with Monte Carlo updates, is the same that underlies out-of-equilibrium simulations based on Jarzynski's equality, which have been recently deployed to compute free-energy differences in lattice gauge theories. We lay out a strategy to optimize the efficiency of this extended class of generative models and present examples of applications.
    Artificial Intelligence-Assisted Optimization and Multiphase Analysis of Polygon PEM Fuel Cells. (arXiv:2205.06768v2 [cs.NE] UPDATED)
    This article presents new hexagonal and pentagonal PEM fuel cell models. The models have been optimized after achieving improved cell performance. The input parameters of the multi-objective optimization algorithm were pressure and temperature at the inlet, and consumption and output powers were the objective parameters. The output data of the numerical simulation has been trained using deep neural networks and then modeled with polynomial regression. The target functions have been extracted using the RSM (Response Surface Method), and the targets were optimized using the multi-objective genetic algorithm (NSGA-II). Compared to the base model, the optimized Pentagonal and Hexagonal models increase the output current density by 21.8% and 39.9%, respectively.
    Speech Denoising in the Waveform Domain with Self-Attention. (arXiv:2202.07790v2 [cs.SD] UPDATED)
    In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.
    Deep Learning Approximation of Diffeomorphisms via Linear-Control Systems. (arXiv:2110.12393v2 [math.OC] UPDATED)
    In this paper we propose a Deep Learning architecture to approximate diffeomorphisms diffeotopic to the identity. We consider a control system of the form $\dot x = \sum_{i=1}^lF_i(x)u_i$, with linear dependence in the controls, and we use the corresponding flow to approximate the action of a diffeomorphism on a compact ensemble of points. Despite the simplicity of the control system, it has been recently shown that a Universal Approximation Property holds. The problem of minimizing the sum of the training error and of a regularizing term induces a gradient flow in the space of admissible controls. A possible training procedure for the discrete-time neural network consists in projecting the gradient flow onto a finite-dimensional subspace of the admissible controls. An alternative approach relies on an iterative method based on Pontryagin Maximum Principle for the numerical resolution of Optimal Control problems. Here the maximization of the Hamiltonian can be carried out with an extremely low computational effort, owing to the linear dependence of the system in the control variables.
    A Recurrent Differentiable Engine for Modeling Tensegrity Robots Trainable with Low-Frequency Data. (arXiv:2203.00041v2 [cs.RO] UPDATED)
    Tensegrity robots, composed of rigid rods and flexible cables, are difficult to accurately model and control given the presence of complex dynamics and high number of DoFs. Differentiable physics engines have been recently proposed as a data-driven approach for model identification of such complex robotic systems. These engines are often executed at a high-frequency to achieve accurate simulation. Ground truth trajectories for training differentiable engines, however, are not typically available at such high frequencies due to limitations of real-world sensors. The present work focuses on this frequency mismatch, which impacts the modeling accuracy. We proposed a recurrent structure for a differentiable physics engine of tensegrity robots, which can be trained effectively even with low-frequency trajectories. To train this new recurrent engine in a robust way, this work introduces relative to prior work: (i) a new implicit integration scheme, (ii) a progressive training pipeline, and (iii) a differentiable collision checker. A model of NASA's icosahedron SUPERballBot on MuJoCo is used as the ground truth system to collect training data. Simulated experiments show that once the recurrent differentiable engine has been trained given the low-frequency trajectories from MuJoCo, it is able to match the behavior of MuJoCo's system. The criterion for success is whether a locomotion strategy learned using the differentiable engine can be transferred back to the ground-truth system and result in a similar motion. Notably, the amount of ground truth data needed to train the differentiable engine, such that the policy is transferable to the ground truth system, is 1% of the data needed to train the policy directly on the ground-truth system.
    Benchmarking of DL Libraries and Models on Mobile Devices. (arXiv:2202.06512v2 [cs.LG] UPDATED)
    Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libs and 15 diversified DL models. We then perform extensive experiments on 10 mobile devices, which help reveal a complete landscape of the current mobile DL libs ecosystem. For example, we find that the best-performing DL lib is severely fragmented across different models and hardware, and the gap between those DL libs can be rather huge. In fact, the impacts of DL libs can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Finally, atop the observations, we summarize practical implications to different roles in the DL lib ecosystem.
    Reconstructing Nonlinear Dynamical Systems from Multi-Modal Time Series. (arXiv:2111.02922v3 [cs.LG] UPDATED)
    Empirically observed time series in physics, biology, or medicine, are commonly generated by some underlying dynamical system (DS) which is the target of scientific interest. There is an increasing interest to harvest machine learning methods to reconstruct this latent DS in a data-driven, unsupervised way. In many areas of science it is common to sample time series observations from many data modalities simultaneously, e.g. electrophysiological and behavioral time series in a typical neuroscience experiment. However, current machine learning tools for reconstructing DSs usually focus on just one data modality. Here we propose a general framework for multi-modal data integration for the purpose of nonlinear DS reconstruction and the analysis of cross-modal relations. This framework is based on dynamically interpretable recurrent neural networks as general approximators of nonlinear DSs, coupled to sets of modality-specific decoder models from the class of generalized linear models. Both an expectation-maximization and a variational inference algorithm for model training are advanced and compared. We show on nonlinear DS benchmarks that our algorithms can efficiently compensate for too noisy or missing information in one data channel by exploiting other channels, and demonstrate on experimental neuroscience data how the algorithm learns to link different data domains to the underlying dynamics.
    Adversarial Mask: Real-World Universal Adversarial Attack on Face Recognition Models. (arXiv:2111.10759v2 [cs.CV] UPDATED)
    Deep learning-based facial recognition (FR) models have demonstrated state-of-the-art performance in the past few years, even when wearing protective medical face masks became commonplace during the COVID-19 pandemic. Given the outstanding performance of these models, the machine learning research community has shown increasing interest in challenging their robustness. Initially, researchers presented adversarial attacks in the digital domain, and later the attacks were transferred to the physical domain. However, in many cases, attacks in the physical domain are conspicuous, and thus may raise suspicion in real-world environments (e.g., airports). In this paper, we propose Adversarial Mask, a physical universal adversarial perturbation (UAP) against state-of-the-art FR models that is applied on face masks in the form of a carefully crafted pattern. In our experiments, we examined the transferability of our adversarial mask to a wide range of FR model architectures and datasets. In addition, we validated our adversarial mask's effectiveness in real-world experiments (CCTV use case) by printing the adversarial pattern on a fabric face mask. In these experiments, the FR system was only able to identify 3.34% of the participants wearing the mask (compared to a minimum of 83.34% with other evaluated masks). A demo of our experiments can be found at: https://youtu.be/_TXkDO5z11w.
    Two-Sample Testing in Reinforcement Learning. (arXiv:2201.08078v2 [cs.LG] UPDATED)
    Value-based reinforcement-learning algorithms have shown strong performances in games, robotics, and other real-world applications. The most popular sample-based method is $Q$-Learning. It subsequently performs updates by adjusting the current $Q$-estimate towards the observed reward and the maximum of the $Q$-estimates of the next state. The procedure introduces maximization bias with approaches like Double $Q$-Learning. We frame the bias problem statistically and consider it an instance of estimating the maximum expected value (MEV) of a set of random variables. We propose the $T$-Estimator (TE) based on two-sample testing for the mean, that flexibly interpolates between over- and underestimation by adjusting the significance level of the underlying hypothesis tests. A generalization, termed $K$-Estimator (KE), obeys the same bias and variance bounds as the TE while relying on a nearly arbitrary kernel function. We introduce modifications of $Q$-Learning and the Bootstrapped Deep $Q$-Network (BDQN) using the TE and the KE. Furthermore, we propose an adaptive variant of the TE-based BDQN that dynamically adjusts the significance level to minimize the absolute estimation bias. All proposed estimators and algorithms are thoroughly tested and validated on diverse tasks and environments, illustrating the bias control and performance potential of the TE and KE.
    Expectation Distance-based Distributional Clustering for Noise-Robustness. (arXiv:2110.08871v3 [cs.LG] UPDATED)
    This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution and, in the process, reducing the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather raw data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the ED distance measure for the case when the uncertainty is Gaussian. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data are also presented, which involves efficiently extracting and using underlying uncertainty information in the form of means and variances (that, for example, is adequate to characterize Gaussian distributions). The results show striking performance improvement over classical clustering of raw data, with higher accuracy realized for ED. This is because while $W_2$ employs only the marginal distributions ignoring the correlations, the proposed ED also uses the joint distributions factoring the correlations into the distance measures.
    SE(3) Equivariant Graph Neural Networks with Complete Local Frames. (arXiv:2110.14811v2 [cs.CE] UPDATED)
    Group equivariance (e.g. SE(3) equivariance) is a critical physical symmetry in science, from classical and quantum physics to computational biology. It enables robust and accurate prediction under arbitrary reference transformations. In light of this, great efforts have been put on encoding this symmetry into deep neural networks, which has been shown to improve the generalization performance and data efficiency for downstream tasks. Constructing an equivariant neural network generally brings high computational costs to ensure expressiveness. Therefore, how to better trade-off the expressiveness and computational efficiency plays a core role in the design of the equivariant deep learning models. In this paper, we propose a framework to construct SE(3) equivariant graph neural networks that can approximate the geometric quantities efficiently. Inspired by differential geometry and physics, we introduce equivariant local complete frames to graph neural networks, such that tensor information at given orders can be projected onto the frames. The local frame is constructed to form an orthonormal basis that avoids direction degeneration and ensure completeness. Since the frames are built only by cross product operations, our method is computationally efficient. We evaluate our method on two tasks: Newton mechanics modeling and equilibrium molecule conformation generation. Extensive experimental results demonstrate that our model achieves the best or competitive performance in two types of datasets.
    Neural network stochastic differential equation models with applications to financial data forecasting. (arXiv:2111.13164v5 [cs.LG] UPDATED)
    In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called L\'evy induced stochastic differential equation network, which explores compounded stochastic differential equations with $\alpha$-stable L\'evy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove the convergence of our algorithm with respect to hyper-parameters of the neural network, and obtain the error bound without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian L\'evy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of L\'evy motion and the prediction lengths.
    A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges. (arXiv:2110.14051v3 [cs.CV] UPDATED)
    Machine learning models often encounter samples that are diverged from the training distribution. Failure to recognize an out-of-distribution (OOD) sample, and consequently assign that sample to an in-class label significantly compromises the reliability of a model. The problem has gained significant attention due to its importance for safety deploying models in open-world settings. Detecting OOD samples is challenging due to the intractability of modeling all possible unknown distributions. To date, several research domains tackle the problem of detecting unfamiliar samples, including anomaly detection, novelty detection, one-class learning, open set recognition, and out-of-distribution detection. Despite having similar and shared concepts, out-of-distribution, open-set, and anomaly detection have been investigated independently. Accordingly, these research avenues have not cross-pollinated, creating research barriers. While some surveys intend to provide an overview of these approaches, they seem to only focus on a specific domain without examining the relationship between different domains. This survey aims to provide a cross-domain and comprehensive review of numerous eminent works in respective areas while identifying their commonalities. Researchers can benefit from the overview of research advances in different fields and develop future methodology synergistically. Furthermore, to the best of our knowledge, while there are surveys in anomaly detection or one-class learning, there is no comprehensive or up-to-date survey on out-of-distribution detection, which our survey covers extensively. Finally, having a unified cross-domain perspective, we discuss and shed light on future lines of research, intending to bring these fields closer together.
    Quantum Logic Gate Synthesis as a Markov Decision Process. (arXiv:1912.12002v2 [quant-ph] UPDATED)
    Reinforcement learning has witnessed recent applications to a variety of tasks in quantum programming. The underlying assumption is that those tasks could be modeled as Markov Decision Processes (MDPs). Here, we investigate the feasibility of this assumption by exploring its consequences for two fundamental tasks in quantum programming: state preparation and gate compilation. By forming discrete MDPs, focusing exclusively on the single-qubit case (both with and without noise), we solve for the optimal policy exactly through policy iteration. We find optimal paths that correspond to the shortest possible sequence of gates to prepare a state, or compile a gate, up to some target accuracy. As an example, we find sequences of $H$ and $T$ gates with length as small as $11$ producing $\sim 99\%$ fidelity for states of the form $(HT)^{n} |0\rangle$ with values as large as $n=10^{10}$. In the presence of gate noise, we demonstrate how the optimal policy adapts to the effects of noisy gates in order to achieve a higher state fidelity. Our work shows that one can meaningfully impose a discrete, stochastic and Markovian nature to a continuous, deterministic and non-Markovian quantum evolution, and provides theoretical insight into why reinforcement learning may be successfully used to find optimally short gate sequences in quantum programming.
    Astroconformer: Inferring Surface Gravity of Stars from Stellar Light Curves with Transformer. (arXiv:2207.02787v1 [astro-ph.SR])
    We introduce Astroconformer, a Transformer-based model to analyze stellar light curves from the Kepler mission. We demonstrate that Astrconformer can robustly infer the stellar surface gravity as a supervised task. Importantly, as Transformer captures long-range information in the time series, it outperforms the state-of-the-art data-driven method in the field, and the critical role of self-attention is proved through ablation experiments. Furthermore, the attention map from Astroconformer exemplifies the long-range correlation information learned by the model, leading to a more interpretable deep learning approach for asteroseismology. Besides data from Kepler, we also show that the method can generalize to sparse cadence light curves from the Rubin Observatory, paving the way for the new era of asteroseismology, harnessing information from long-cadence ground-based observations.
    Deep Learning-based automated classification of Chinese Speech Sound Disorders. (arXiv:2205.11748v4 [cs.SD] CROSS LISTED)
    This article describes a system for analyzing acoustic data to assist in the diagnosis and classification of children's speech sound disorders (SSDs) using a computer. The analysis concentrated on identifying and categorizing four distinct types of Chinese SSDs. The study collected and generated a speech corpus containing 2540 stopping, backing, final consonant deletion process (FCDP), and affrication samples from 90 children aged 3--6 years with normal or pathological articulatory features. Each recording was accompanied by a detailed diagnostic annotation by two speech-language pathologists (SLPs). Classification of the speech samples was accomplished using three well-established neural network models for image classification. The feature maps were created using three sets of Mel-frequency cepstral coefficients (MFCC) parameters extracted from speech sounds and aggregated into a three-dimensional data structure as model input. We employed six techniques for data augmentation to augment the available dataset while avoiding overfitting. The experiments examine the usability of four different categories of Chinese phrases and characters. Experiments with different data subsets demonstrate the system's ability to accurately detect the analyzed pronunciation disorders. The best multi-class classification using a single Chinese phrase achieves an accuracy of 74.4~percent.
    Federated Neural Architecture Search. (arXiv:2002.06352v5 [cs.LG] UPDATED)
    To preserve user privacy while enabling mobile intelligence, techniques have been proposed to train deep neural networks on decentralized data. However, training over decentralized data makes the design of neural architecture quite difficult as it already was. Such difficulty is further amplified when designing and deploying different neural architectures for heterogeneous mobile platforms. In this work, we propose an automatic neural architecture search into the decentralized training, as a new DNN training paradigm called Federated Neural Architecture Search, namely federated NAS. To deal with the primary challenge of limited on-client computational and communication resources, we present FedNAS, a highly optimized framework for efficient federated NAS. FedNAS fully exploits the key opportunity of insufficient model candidate re-training during the architecture search process, and incorporates three key optimizations: parallel candidates training on partial clients, early dropping candidates with inferior performance, and dynamic round numbers. Tested on large-scale datasets and typical CNN architectures, FedNAS achieves comparable model accuracy as state-of-the-art NAS algorithm that trains models with centralized data, and also reduces the client cost by up to two orders of magnitude compared to a straightforward design of federated NAS.
    A multi-task network approach for calculating discrimination-free insurance prices. (arXiv:2207.02799v1 [cs.LG])
    In applications of predictive modeling, such as insurance pricing, indirect or proxy discrimination is an issue of major concern. Namely, there exists the possibility that protected policyholder characteristics are implicitly inferred from non-protected ones by predictive models, and are thus having an undesirable (or illegal) impact on prices. A technical solution to this problem relies on building a best-estimate model using all policyholder characteristics (including protected ones) and then averaging out the protected characteristics for calculating individual prices. However, such approaches require full knowledge of policyholders' protected characteristics, which may in itself be problematic. Here, we address this issue by using a multi-task neural network architecture for claim predictions, which can be trained using only partial information on protected characteristics, and it produces prices that are free from proxy discrimination. We demonstrate the use of the proposed model and we find that its predictive accuracy is comparable to a conventional feedforward neural network (on full information). However, this multi-task network has clearly superior performance in the case of partially missing policyholder information.
    Integral Probability Metrics PAC-Bayes Bounds. (arXiv:2207.00614v2 [stat.ML] UPDATED)
    We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and preferable bounds in better cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
    Simple and Efficient Heterogeneous Graph Neural Network. (arXiv:2207.02547v1 [cs.LG])
    Heterogeneous graph neural networks (HGNNs) deliver the powerful capability to embed rich structural and semantic information of a heterogeneous graph into low-dimensional node representations. Existing HGNNs usually learn to embed information using hierarchy attention mechanism and repeated neighbor aggregation, suffering from unnecessary complexity and redundant computation. This paper proposes Simple and Efficient Heterogeneous Graph Neural Network (SeHGNN) which reduces this excess complexity through avoiding overused node-level attention within the same relation and pre-computing the neighbor aggregation in the pre-processing stage. Unlike previous work, SeHGNN utilizes a light-weight parameter-free neighbor aggregator to learn structural information for each metapath, and a transformer-based semantic aggregator to combine semantic information across metapaths for the final embedding of each node. As a result, SeHGNN offers the simple network structure, high prediction accuracy, and fast training speed. Extensive experiments on five real-world heterogeneous graphs demonstrate the superiority of SeHGNN over the state-of-the-arts on both the accuracy and training speed. Codes are available at https://github.com/ICT-GIMLab/SeHGNN.
    Transformers discover an elementary calculation system exploiting local attention and grid-like problem representation. (arXiv:2207.02536v1 [cs.LG])
    Mathematical reasoning is one of the most impressive achievements of human intellect but remains a formidable challenge for artificial intelligence systems. In this work we explore whether modern deep learning architectures can learn to solve a symbolic addition task by discovering effective arithmetic procedures. Although the problem might seem trivial at first glance, generalizing arithmetic knowledge to operations involving a higher number of terms, possibly composed by longer sequences of digits, has proven extremely challenging for neural networks. Here we show that universal transformers equipped with local attention and adaptive halting mechanisms can learn to exploit an external, grid-like memory to carry out multi-digit addition. The proposed model achieves remarkable accuracy even when tested with problems requiring extrapolation outside the training distribution; most notably, it does so by discovering human-like calculation strategies such as place value alignment.
    A Hybrid Approach for Binary Classification of Imbalanced Data. (arXiv:2207.02738v1 [cs.LG])
    Binary classification with an imbalanced dataset is challenging. Models tend to consider all samples as belonging to the majority class. Although existing solutions such as sampling methods, cost-sensitive methods, and ensemble learning methods improve the poor accuracy of the minority class, these methods are limited by overfitting problems or cost parameters that are difficult to decide. We propose HADR, a hybrid approach with dimension reduction that consists of data block construction, dimentionality reduction, and ensemble learning with deep neural network classifiers. We evaluate the performance on eight imbalanced public datasets in terms of recall, G-mean, and AUC. The results show that our model outperforms state-of-the-art methods.
    Text Enriched Sparse Hyperbolic Graph Convolutional Networks. (arXiv:2207.02368v1 [cs.IR])
    Heterogeneous networks, which connect informative nodes containing text with different edge types, are routinely used to store and process information in various real-world applications. Graph Neural Networks (GNNs) and their hyperbolic variants provide a promising approach to encode such networks in a low-dimensional latent space through neighborhood aggregation and hierarchical feature extraction, respectively. However, these approaches typically ignore metapath structures and the available semantic information. Furthermore, these approaches are sensitive to the noise present in the training data. To tackle these limitations, in this paper, we propose Text Enriched Sparse Hyperbolic Graph Convolution Network (TESH-GCN) to capture the graph's metapath structures using semantic signals and further improve prediction in large heterogeneous graphs. In TESH-GCN, we extract semantic node information, which successively acts as a connection signal to extract relevant nodes' local neighborhood and graph-level metapath features from the sparse adjacency tensor in a reformulated hyperbolic graph convolution layer. These extracted features in conjunction with semantic features from the language model (for robustness) are used for the final downstream task. Experiments on various heterogeneous graph datasets show that our model outperforms the current state-of-the-art approaches by a large margin on the task of link prediction. We also report a reduction in both the training time and model parameters compared to the existing hyperbolic approaches through a reformulated hyperbolic graph convolution. Furthermore, we illustrate the robustness of our model by experimenting with different levels of simulated noise in both the graph structure and text, and also, present a mechanism to explain TESH-GCN's prediction by analyzing the extracted metapaths.
    Cascaded Deep Hybrid Models for Multistep Household Energy Consumption Forecasting. (arXiv:2207.02589v1 [cs.LG])
    Sustainability requires increased energy efficiency with minimal waste. The future power systems should thus provide high levels of flexibility iin controling energy consumption. Precise projections of future energy demand/load at the aggregate and on the individual site levels are of great importance for decision makers and professionals in the energy industry. Forecasting energy loads has become more advantageous for energy providers and customers, allowing them to establish an efficient production strategy to satisfy demand. This study introduces two hybrid cascaded models for forecasting multistep household power consumption in different resolutions. The first model integrates Stationary Wavelet Transform (SWT), as an efficient signal preprocessing technique, with Convolutional Neural Networks and Long Short Term Memory (LSTM). The second hybrid model combines SWT with a self-attention based neural network architecture named transformer. The major constraint of using time-frequency analysis methods such as SWT in multistep energy forecasting problems is that they require sequential signals, making signal reconstruction problematic in multistep forecasting applications.The cascaded models can efficiently address this problem through using the recursive outputs. Experimental results show that the proposed hybrid models achieve superior prediction performance compared to the existing multistep power consumption prediction methods. The results will pave the way for more accurate and reliable forecasting of household power consumption.
    Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design. (arXiv:2207.02575v1 [cs.LG])
    While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.
    The Intrinsic Manifolds of Radiological Images and their Role in Deep Learning. (arXiv:2207.02797v1 [eess.IV])
    The manifold hypothesis is a core mechanism behind the success of deep learning, so understanding the intrinsic manifold structure of image data is central to studying how neural networks learn from the data. Intrinsic dataset manifolds and their relationship to learning difficulty have recently begun to be studied for the common domain of natural images, but little such research has been attempted for radiological images. We address this here. First, we compare the intrinsic manifold dimensionality of radiological and natural images. We also investigate the relationship between intrinsic dimensionality and generalization ability over a wide range of datasets. Our analysis shows that natural image datasets generally have a higher number of intrinsic dimensions than radiological images. However, the relationship between generalization ability and intrinsic dimensionality is much stronger for medical images, which could be explained as radiological images having intrinsic features that are more difficult to learn. These results give a more principled underpinning for the intuition that radiological images can be more challenging to apply deep learning to than natural image datasets common to machine learning research. We believe rather than directly applying models developed for natural images to the radiological imaging domain, more care should be taken to developing architectures and algorithms that are more tailored to the specific characteristics of this domain. The research shown in our paper, demonstrating these characteristics and the differences from natural images, is an important first step in this direction.
    Pure Transformers are Powerful Graph Learners. (arXiv:2207.02505v1 [cs.LG])
    We show that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice. Given a graph, we simply treat all nodes and edges as independent tokens, augment them with token embeddings, and feed them to a Transformer. With an appropriate choice of token embeddings, we prove that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers, which is already more expressive than all message-passing Graph Neural Networks (GNN). When trained on a large-scale graph dataset (PCQM4Mv2), our method coined Tokenized Graph Transformer (TokenGT) achieves significantly better results compared to GNN baselines and competitive results compared to Transformer variants with sophisticated graph-specific inductive bias. Our implementation is available at https://github.com/jw9730/tokengt.
    Instance-optimal PAC Algorithms for Contextual Bandits. (arXiv:2207.02357v1 [stat.ML])
    In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(\epsilon,\delta)$-$\textit{PAC}$ setting: given a policy class $\Pi$ the goal of the learner is to return a policy $\pi\in \Pi$ whose expected reward is within $\epsilon$ of the optimal policy with probability greater than $1-\delta$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $\rho_{\Pi}$, and provide matching upper and lower bounds in terms of $\rho_{\Pi}$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.
    Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation. (arXiv:2205.14141v2 [cs.CV] UPDATED)
    Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching \textbf{89.0%} top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy on ADE20K semantic segmentation is improved by +1.5 mIoU to \textbf{61.4 mIoU}, creating a new record. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.
    Characterizing and Mitigating the Difficulty in Training Physics-informed Artificial Neural Networks under Pointwise Constraints. (arXiv:2206.09321v2 [cs.LG] UPDATED)
    Neural networks can be used to learn the solution of partial differential equations (PDEs) on arbitrary domains without requiring a computational mesh. Common approaches integrate differential operators in training neural networks using a structured loss function. The most common training algorithm for neural networks is backpropagation which relies on the gradient of the loss function with respect to the parameters of the network. In this work, we characterize the difficulty of training neural networks on physics by investigating the impact of differential operators in corrupting the back propagated gradients. Particularly, we show that perturbations present in the output of a neural network model during early stages of training lead to higher levels of noise in a structured loss function that is composed of high-order differential operators. These perturbations consequently corrupt the back-propagated gradients and impede convergence. We mitigate this issue by introducing auxiliary flux parameters to obtain a system of first-order differential equations. We formulate a non-linear unconstrained optimization problem using the augmented Lagrangian method that properly constrains the boundary conditions and adaptively focus on regions of higher gradients that are difficult to learn. We apply our approach to learn the solution of various benchmark PDE problems and demonstrate orders of magnitude improvement over existing approaches.
    Self-Normalized Density Map (SNDM) for Counting Microbiological Objects. (arXiv:2203.09474v2 [cs.CV] UPDATED)
    The statistical properties of the density map (DM) approach to counting microbiological objects on images are studied in detail. The DM is given by U$^2$-Net. Two statistical methods for deep neural networks are utilized: the bootstrap and the Monte Carlo (MC) dropout. The detailed analysis of the uncertainties for the DM predictions leads to a deeper understanding of the DM model's deficiencies. Based on our investigation, we propose a self-normalization module in the network. The improved network model, called \textit{Self-Normalized Density Map} (SNDM), can correct its output density map by itself to accurately predict the total number of objects in the image. The SNDM architecture outperforms the original model. Moreover, both statistical frameworks -- bootstrap and MC dropout -- have consistent statistical results for SNDM, which were not observed in the original model. The SNDM efficiency is comparable with the detector-base models, such as Faster and Cascade R-CNN detectors.
    A Heterogeneous Graph Based Framework for Multimodal Neuroimaging Fusion Learning. (arXiv:2110.08465v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) provide powerful insights for brain neuroimaging technology from the view of graphical networks. However, most existing GNN-based models assume that the neuroimaging-produced brain connectome network is a homogeneous graph with single types of nodes and edges. In fact, emerging studies have reported and emphasized the significance of heterogeneity among human brain activities, especially between the two cerebral hemispheres. Thus, homogeneous-structured brain network-based graph methods are insufficient for modelling complicated cerebral activity states. To overcome this problem, in this paper, we present a heterogeneous graph neural network (HeBrainGNN) for multimodal brain neuroimaging fusion learning. We first model the brain network as a heterogeneous graph with multitype nodes (i.e., left and right hemispheric nodes) and multitype edges (i.e., intra- and interhemispheric edges). Then, we propose a self-supervised pretraining strategy based on a heterogeneous brain network to address the potential overfitting problem caused by the conflict between a large parameter size and a small medical data sample size. Our results show the superiority of the proposed model over other existing methods in brain-related disease prediction tasks. Ablation experiments show that our heterogeneous graph-based model attaches more importance to hemishpheric connections that may be neglected due to their low strength by previous homogeneous graph models. Other experiments also indicate that our proposed model with a pretraining strategy alleviates the problem of limited labelled data and yields a significant improvement in accuracy.
    Deep Contrastive Patch-Based Subspace Learning for Camera Image Signal Processing. (arXiv:2104.00253v3 [eess.IV] UPDATED)
    Camera Image Signal Processing(ISP) pipelines, including deep learning trained versions, can get appealing results in different image signal processing tasks. However, most if not all of these methods tend to apply a single filter that is homogeneous over the entire image. This is also particularly true when an encoder-decoder type deep architecture is trained for the task. However, it is natural to view a camera image as heterogeneous, as the color intensity and the artificial noise are distributed vastly different, even across the two dimensional domain of a single image. Varied Moire ringing, motion-blur, color-bleaching or lens based projection distortions can all potentially lead to a heterogeneous image artifact filtering problem. In this paper, we present a specific patch-based, local subspace deep neural network that improves Camera ISP to be robust to heterogeneous artifacts (especially image denoising). We call our three-fold deep trained model the Patch Subspace Learning Autoencoder (PSL-AE). PSL-AE does not necessarily assume uniform image distortion levels nor repeated nor similar artifact types within the image. Rather, PSL-AE first diagnostically encodes patches extracted from noisy and clean image pairs, with different artifact type and distortion levels, by contrastive learning. Then, each image's patches are encoded into soft-clusters in their appropriate latent sub-space, using a prior mixture model. Lastly, the decoders of the PSL-AE are also trained in an unsupervised manner customized for the image patches in each soft-cluster. Our experimental results demonstrates the flexibility and performance that one can achieve through improved heterogeneous filtering, both from synthesized artifacts but also realistic SIDD image pairs.
    Domain Adaptive Hand Keypoint and Pixel Localization in the Wild. (arXiv:2203.08344v4 [cs.CV] UPDATED)
    We aim to improve the performance of regressing hand keypoints and segmenting pixel-level hand masks under new imaging conditions (e.g., outdoors) when we only have labeled images taken under very different conditions (e.g., indoors). In the real world, it is important that the model trained for both tasks works under various imaging conditions. However, their variation covered by existing labeled hand datasets is limited. Thus, it is necessary to adapt the model trained on the labeled images (source) to unlabeled images (target) with unseen imaging conditions. While self-training domain adaptation methods (i.e., learning from the unlabeled target images in a self-supervised manner) have been developed for both tasks, their training may degrade performance when the predictions on the target images are noisy. To avoid this, it is crucial to assign a low importance (confidence) weight to the noisy predictions during self-training. In this paper, we propose to utilize the divergence of two predictions to estimate the confidence of the target image for both tasks. These predictions are given from two separate networks, and their divergence helps identify the noisy predictions. To integrate our proposed confidence estimation into self-training, we propose a teacher-student framework where the two networks (teachers) provide supervision to a network (student) for self-training, and the teachers are learned from the student by knowledge distillation. Our experiments show its superiority over state-of-the-art methods in adaptation settings with different lighting, grasping objects, backgrounds, and camera viewpoints. Our method improves by 4% the multi-task score on HO3D compared to the latest adversarial adaptation method. We also validate our method on Ego4D, egocentric videos with rapid changes in imaging conditions outdoors.
    SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics. (arXiv:2204.09424v2 [cs.LG] UPDATED)
    Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy. The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms.
    Predicting Kidney Transplant Survival using Multiple Feature Representations for HLAs. (arXiv:2103.03305v2 [cs.LG] UPDATED)
    Kidney transplantation can significantly enhance living standards for people suffering from end-stage renal disease. A significant factor that affects graft survival time (the time until the transplant fails and the patient requires another transplant) for kidney transplantation is the compatibility of the Human Leukocyte Antigens (HLAs) between the donor and recipient. In this paper, we propose 4 new biologically-relevant feature representations for incorporating HLA information into machine learning-based survival analysis algorithms. We evaluate our proposed HLA feature representations on a database of over 100,000 transplants and find that they improve prediction accuracy by about 1%, modest at the patient level but potentially significant at a societal level. Accurate prediction of survival times can improve transplant survival outcomes, enabling better allocation of donors to recipients and reducing the number of re-transplants due to graft failure with poorly matched donors.
    Variational Flow Graphical Model. (arXiv:2207.02722v1 [stat.ML])
    This paper introduces a novel approach to embed flow-based models with hierarchical structures. The proposed framework is named Variational Flow Graphical (VFG) Model. VFGs learn the representation of high dimensional data via a message-passing scheme by integrating flow-based functions through variational inference. By leveraging the expressive power of neural networks, VFGs produce a representation of the data using a lower dimension, thus overcoming the drawbacks of many flow-based models, usually requiring a high dimensional latent space involving many trivial variables. Aggregation nodes are introduced in the VFG models to integrate forward-backward hierarchical information via a message passing scheme. Maximizing the evidence lower bound (ELBO) of data likelihood aligns the forward and backward messages in each aggregation node achieving a consistency node state. Algorithms have been developed to learn model parameters through gradient updating regarding the ELBO objective. The consistency of aggregation nodes enable VFGs to be applicable in tractable inference on graphical structures. Besides representation learning and numerical inference, VFGs provide a new approach for distribution modeling on datasets with graphical latent structures. Additionally, theoretical study shows that VFGs are universal approximators by leveraging the implicitly invertible flow-based structures. With flexible graphical structures and superior excessive power, VFGs could potentially be used to improve probabilistic inference. In the experiments, VFGs achieves improved evidence lower bound (ELBO) and likelihood values on multiple datasets.
    PAC Prediction Sets for Meta-Learning. (arXiv:2207.02440v1 [cs.LG])
    Uncertainty quantification is a key component of machine learning models targeted at safety-critical systems such as in healthcare or autonomous vehicles. We study this problem in the context of meta learning, where the goal is to quickly adapt a predictor to new tasks. In particular, we propose a novel algorithm to construct \emph{PAC prediction sets}, which capture uncertainty via sets of labels, that can be adapted to new tasks with only a few training examples. These prediction sets satisfy an extension of the typical PAC guarantee to the meta learning setting; in particular, the PAC guarantee holds with high probability over future tasks. We demonstrate the efficacy of our approach on four datasets across three application domains: mini-ImageNet and CIFAR10-C in the visual domain, FewRel in the language domain, and the CDC Heart Dataset in the medical domain. In particular, our prediction sets satisfy the PAC guarantee while having smaller size compared to other baselines that also satisfy this guarantee.
    Enabling Fast Deep Learning on Tiny Energy-Harvesting IoT Devices. (arXiv:2111.14051v3 [cs.LG] UPDATED)
    Energy harvesting (EH) IoT devices that operate intermittently without batteries, coupled with advances in deep neural networks (DNNs), have opened up new opportunities for enabling sustainable smart applications. Nevertheless, implementing those computation and memory-intensive intelligent algorithms on EH devices is extremely difficult due to the challenges of limited resources and intermittent power supply that causes frequent failures. To address those challenges, this paper proposes a methodology that enables fast deep learning with low-energy accelerators for tiny energy harvesting devices. We first propose $RAD$, a resource-aware structured DNN training framework, which employs block circulant matrix and structured pruning to achieve high compression for leveraging the advantage of various vector operation accelerators. A DNN implementation method, $ACE$, is then proposed that employs low-energy accelerators to profit maximum performance with small energy consumption. Finally, we further design $FLEX$, the system support for intermittent computation in energy harvesting situations. Experimental results from three different DNN models demonstrate that $RAD$, $ACE$, and $FLEX$ can enable fast and correct inference on energy harvesting devices with up to 4.26X runtime reduction, up to 7.7X energy reduction with higher accuracy over the state-of-the-art.
    Fast Sparse Decision Tree Optimization via Reference Ensembles. (arXiv:2112.00798v7 [cs.LG] UPDATED)
    Sparse decision tree optimization has been one of the most fundamental problems in AI since its inception and is a challenge at the core of interpretable machine learning. Sparse decision tree optimization is computationally hard, and despite steady effort since the 1960's, breakthroughs have only been made on the problem within the past few years, primarily on the problem of finding optimal sparse decision trees. However, current state-of-the-art algorithms often require impractical amounts of computation time and memory to find optimal or near-optimal trees for some real-world datasets, particularly those having several continuous-valued features. Given that the search spaces of these decision tree optimization problems are massive, can we practically hope to find a sparse decision tree that competes in accuracy with a black box machine learning model? We address this problem via smart guessing strategies that can be applied to any optimal branch-and-bound-based decision tree algorithm. We show that by using these guesses, we can reduce the run time by multiple orders of magnitude, while providing bounds on how far the resulting trees can deviate from the black box's accuracy and expressive power. Our approach enables guesses about how to bin continuous features, the size of the tree, and lower bounds on the error for the optimal decision tree. Our experiments show that in many cases we can rapidly construct sparse decision trees that match the accuracy of black box models. To summarize: when you are having trouble optimizing, just guess.
    Online Bilevel Optimization: Regret Analysis of Online Alternating Gradient Methods. (arXiv:2207.02829v1 [math.OC])
    Online optimization is a well-established optimization paradigm that aims to make a sequence of correct decisions given knowledge of the correct answer to previous decision tasks. Bilevel programming involves a hierarchical optimization problem where the feasible region of the so-called outer problem is restricted by the graph of the solution set mapping of the inner problem. This paper brings these two ideas together and studies an online bilevel optimization setting in which a sequence of time-varying bilevel problems are revealed one after the other. We extend the known regret bounds for single-level online algorithms to the bilevel setting. Specifically, we introduce new notions of bilevel regret, develop an online alternating time-averaged gradient method that is capable of leveraging smoothness, and provide regret bounds in terms of the path-length of the inner and outer minimizer sequences.
    AutoSpeed: A Linked Autoencoder Approach for Pulse-Echo Speed-of-Sound Imaging for Medical Ultrasound. (arXiv:2207.02392v1 [eess.IV])
    Quantitative ultrasound, e.g., speed-of-sound (SoS) in tissues, provides information about tissue properties that have diagnostic value. Recent studies showed the possibility of extracting SoS information from pulse-echo ultrasound raw data (a.k.a. RF data) using deep neural networks that are fully trained on simulated data. These methods take sensor domain data, i.e., RF data, as input and train a network in an end-to-end fashion to learn the implicit mapping between the RF data domain and SoS domain. However, such networks are prone to overfitting to simulated data which results in poor performance and instability when tested on measured data. We propose a novel method for SoS mapping employing learned representations from two linked autoencoders. We test our approach on simulated and measured data acquired from human breast mimicking phantoms. We show that SoS mapping is possible using linked autoencoders. The proposed method has a Mean Absolute Percentage Error (MAPE) of 2.39% on the simulated data. On the measured data, the predictions of the proposed method are close to the expected values with MAPE of 1.1%. Compared to an end-to-end trained network, the proposed method shows higher stability and reproducibility.
    TractoFormer: A Novel Fiber-level Whole Brain Tractography Analysis Framework Using Spectral Embedding and Vision Transformers. (arXiv:2207.02327v1 [eess.IV])
    Diffusion MRI tractography is an advanced imaging technique for quantitative mapping of the brain's structural connectivity. Whole brain tractography (WBT) data contains over hundreds of thousands of individual fiber streamlines (estimated brain connections), and this data is usually parcellated to create compact representations for data analysis applications such as disease classification. In this paper, we propose a novel parcellation-free WBT analysis framework, TractoFormer, that leverages tractography information at the level of individual fiber streamlines and provides a natural mechanism for interpretation of results using the attention mechanism of transformers. TractoFormer includes two main contributions. First, we propose a novel and simple 2D image representation of WBT, TractoEmbedding, to encode 3D fiber spatial relationships and any feature of interest that can be computed from individual fibers (such as FA or MD). Second, we design a network based on vision transformers (ViTs) that includes: 1) data augmentation to overcome model overfitting on small datasets, 2) identification of discriminative fibers for interpretation of results, and 3) ensemble learning to leverage fiber information from different brain regions. In a synthetic data experiment, TractoFormer successfully identifies discriminative fibers with simulated group differences. In a disease classification experiment comparing several methods, TractoFormer achieves the highest accuracy in classifying schizophrenia vs control. Discriminative fibers are identified in left hemispheric frontal and parietal superficial white matter regions, which have previously been shown to be affected in schizophrenia patients.
    Ordinal Regression via Binary Preference vs Simple Regression: Statistical and Experimental Perspectives. (arXiv:2207.02454v1 [cs.LG])
    Ordinal regression with anchored reference samples (ORARS) has been proposed for predicting the subjective Mean Opinion Score (MOS) of input stimuli automatically. The ORARS addresses the MOS prediction problem by pairing a test sample with each of the pre-scored anchored reference samples. A trained binary classifier is then used to predict which sample, test or anchor, is better statistically. Posteriors of the binary preference decision are then used to predict the MOS of the test sample. In this paper, rigorous framework, analysis, and experiments to demonstrate that ORARS are advantageous over simple regressions are presented. The contributions of this work are: 1) Show that traditional regression can be reformulated into multiple preference tests to yield a better performance, which is confirmed with simulations experimentally; 2) Generalize ORARS to other regression problems and verify its effectiveness; 3) Provide some prerequisite conditions which can insure proper application of ORARS.
    Effective and Efficient Training for Sequential Recommendation using Recency Sampling. (arXiv:2207.02643v1 [cs.IR])
    Many modern sequential recommender systems use deep neural networks, which can effectively estimate the relevance of items but require a lot of time to train. Slow training increases expenses, hinders product development timescales and prevents the model from being regularly updated to adapt to changing user preferences. Training such sequential models involves appropriately sampling past user interactions to create a realistic training objective. The existing training objectives have limitations. For instance, next item prediction never uses the beginning of the sequence as a learning target, thereby potentially discarding valuable data. On the other hand, the item masking used by BERT4Rec is only weakly related to the goal of the sequential recommendation; therefore, it requires much more time to obtain an effective model. Hence, we propose a novel Recency-based Sampling of Sequences training objective that addresses both limitations. We apply our method to various recent and state-of-the-art model architectures - such as GRU4Rec, Caser, and SASRec. We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec, but with much less training time.
    Tractable Dendritic RNNs for Reconstructing Nonlinear Dynamical Systems. (arXiv:2207.02542v1 [cs.LG])
    In many scientific disciplines, we are interested in inferring the nonlinear dynamical system underlying a set of observed time series, a challenging task in the face of chaotic behavior and noise. Previous deep learning approaches toward this goal often suffered from a lack of interpretability and tractability. In particular, the high-dimensional latent spaces often required for a faithful embedding, even when the underlying dynamics lives on a lower-dimensional manifold, can hamper theoretical analysis. Motivated by the emerging principles of dendritic computation, we augment a dynamically interpretable and mathematically tractable piecewise-linear (PL) recurrent neural network (RNN) by a linear spline basis expansion. We show that this approach retains all the theoretically appealing properties of the simple PLRNN, yet boosts its capacity for approximating arbitrary nonlinear dynamical systems in comparatively low dimensions. We employ two frameworks for training the system, one combining back-propagation-through-time (BPTT) with teacher forcing, and another based on fast and scalable variational inference. We show that the dendritically expanded PLRNN achieves better reconstructions with fewer parameters and dimensions on various dynamical systems benchmarks and compares favorably to other methods, while retaining a tractable and interpretable structure.
    Ensemble feature selection with clustering for analysis of high-dimensional, correlated clinical data in the search for Alzheimer's disease biomarkers. (arXiv:2207.02380v1 [cs.LG])
    Healthcare datasets often contain groups of highly correlated features, such as features from the same biological system. When feature selection is applied to these datasets to identify the most important features, the biases inherent in some multivariate feature selectors due to correlated features make it difficult for these methods to distinguish between the important and irrelevant features and the results of the feature selection process can be unstable. Feature selection ensembles, which aggregate the results of multiple individual base feature selectors, have been investigated as a means of stabilising feature selection results, but do not address the problem of correlated features. We present a novel framework to create feature selection ensembles from multivariate feature selectors while taking into account the biases produced by groups of correlated features, using agglomerative hierarchical clustering in a pre-processing step. These methods were applied to two real-world datasets from studies of Alzheimer's disease (AD), a progressive neurodegenerative disease that has no cure and is not yet fully understood. Our results show a marked improvement in the stability of features selected over the models without clustering, and the features selected by these models are in keeping with the findings in the AD literature.
    Strong Heuristics for Named Entity Linking. (arXiv:2207.02824v1 [cs.CL])
    Named entity linking (NEL) in news is a challenging endeavour due to the frequency of unseen and emerging entities, which necessitates the use of unsupervised or zero-shot methods. However, such methods tend to come with caveats, such as no integration of suitable knowledge bases (like Wikidata) for emerging entities, a lack of scalability, and poor interpretability. Here, we consider person disambiguation in Quotebank, a massive corpus of speaker-attributed quotations from the news, and investigate the suitability of intuitive, lightweight, and scalable heuristics for NEL in web-scale corpora. Our best performing heuristic disambiguates 94% and 63% of the mentions on Quotebank and the AIDA-CoNLL benchmark, respectively. Additionally, the proposed heuristics compare favourably to the state-of-the-art unsupervised and zero-shot methods, Eigenthemes and mGENRE, respectively, thereby serving as strong baselines for unsupervised and zero-shot entity linking.
    Rethinking the Importance of Sampling in Physics-informed Neural Networks. (arXiv:2207.02338v1 [cs.LG])
    Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving partial differential equations (PDEs) in a variety of domains. While previous research in PINNs has mainly focused on constructing and balancing loss functions during training to avoid poor minima, the effect of sampling collocation points on the performance of PINNs has largely been overlooked. In this work, we find that the performance of PINNs can vary significantly with different sampling strategies, and using a fixed set of collocation points can be quite detrimental to the convergence of PINNs to the correct solution. In particular, (1) we hypothesize that training of PINNs rely on successful "propagation" of solution from initial and/or boundary condition points to interior points, and PINNs with poor sampling strategies can get stuck at trivial solutions if there are \textit{propagation failures}. (2) We demonstrate that propagation failures are characterized by highly imbalanced PDE residual fields where very high residuals are observed over very narrow regions. (3) To mitigate propagation failure, we propose a novel \textit{evolutionary sampling} (Evo) method that can incrementally accumulate collocation points in regions of high PDE residuals. We further provide an extension of Evo to respect the principle of causality while solving time-dependent PDEs. We empirically demonstrate the efficacy and efficiency of our proposed methods in a variety of PDE problems.
    Quantitative Assessment of DESIS Hyperspectral Data for Plant Biodiversity Estimation in Australia. (arXiv:2207.02482v1 [cs.LG])
    Diversity of terrestrial plants plays a key role in maintaining a stable, healthy, and productive ecosystem. Though remote sensing has been seen as a promising and cost-effective proxy for estimating plant diversity, there is a lack of quantitative studies on how confidently plant diversity can be inferred from spaceborne hyperspectral data. In this study, we assessed the ability of hyperspectral data captured by the DLR Earth Sensing Imaging Spectrometer (DESIS) for estimating plant species richness in the Southern Tablelands and Snowy Mountains regions in southeast Australia. Spectral features were firstly extracted from DESIS spectra with principal component analysis, canonical correlation analysis, and partial least squares analysis. Then regression was conducted between the extracted features and plant species richness with ordinary least squares regression, kernel ridge regression, and Gaussian process regression. Results were assessed with the coefficient of correlation ($r$) and Root-Mean-Square Error (RMSE), based on a two-fold cross validation scheme. With the best performing model, $r$ is 0.71 and RMSE is 5.99 for the Southern Tablelands region, while $r$ is 0.62 and RMSE is 6.20 for the Snowy Mountains region. The assessment results reported in this study provide supports for future studies on understanding the relationship between spaceborne hyperspectral measurements and terrestrial plant biodiversity.
    Cooperative Distribution Alignment via JSD Upper Bound. (arXiv:2207.02286v1 [cs.LG])
    Unsupervised distribution alignment estimates a transformation that maps two or more source distributions to a shared aligned distribution given only samples from each distribution. This task has many applications including generative modeling, unsupervised domain adaptation, and socially aware learning. Most prior works use adversarial learning (i.e., min-max optimization), which can be challenging to optimize and evaluate. A few recent works explore non-adversarial flow-based (i.e., invertible) approaches, but they lack a unified perspective and are limited in efficiently aligning multiple distributions. Therefore, we propose to unify and generalize previous flow-based approaches under a single non-adversarial framework, which we prove is equivalent to minimizing an upper bound on the Jensen-Shannon Divergence (JSD). Importantly, our problem reduces to a min-min, i.e., cooperative, problem and can provide a natural evaluation metric for unsupervised distribution alignment. We present empirical results of our framework on both simulated and real-world datasets to demonstrate the benefits of our approach.
    Composite FORCE learning of chaotic echo state networks for time-series prediction. (arXiv:2207.02420v1 [cs.LG])
    Echo state network (ESN), a kind of recurrent neural networks, consists of a fixed reservoir in which neurons are connected randomly and recursively and obtains the desired output only by training output connection weights. First-order reduced and controlled error (FORCE) learning is an online supervised training approach that can change the chaotic activity of ESNs into specified activity patterns. This paper proposes a composite FORCE learning method based on recursive least squares to train ESNs whose initial activity is spontaneously chaotic, where a composite learning technique featured by dynamic regressor extension and memory data exploitation is applied to enhance parameter convergence. The proposed method is applied to a benchmark problem about predicting chaotic time series generated by the Mackey-Glass system, and numerical results have shown that it significantly improves learning and prediction performances compared with existing methods.
    Private Matrix Approximation and Geometry of Unitary Orbits. (arXiv:2207.02794v1 [cs.DS])
    Consider the following optimization problem: Given $n \times n$ matrices $A$ and $\Lambda$, maximize $\langle A, U\Lambda U^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $\Lambda$ and, by setting $\Lambda$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users' private data. We give efficient and private algorithms that come with upper and lower bounds on the approximation error. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.
    Predicting is not Understanding: Recognizing and Addressing Underspecification in Machine Learning. (arXiv:2207.02598v1 [cs.LG])
    Machine learning (ML) models are typically optimized for their accuracy on a given dataset. However, this predictive criterion rarely captures all desirable properties of a model, in particular how well it matches a domain expert's understanding of a task. Underspecification refers to the existence of multiple models that are indistinguishable in their in-domain accuracy, even though they differ in other desirable properties such as out-of-distribution (OOD) performance. Identifying these situations is critical for assessing the reliability of ML models. We formalize the concept of underspecification and propose a method to identify and partially address it. We train multiple models with an independence constraint that forces them to implement different functions. They discover predictive features that are otherwise ignored by standard empirical risk minimization (ERM), which we then distill into a global model with superior OOD performance. Importantly, we constrain the models to align with the data manifold to ensure that they discover meaningful features. We demonstrate the method on multiple datasets in computer vision (collages, WILDS-Camelyon17, GQA) and discuss general implications of underspecification. Most notably, in-domain performance cannot serve for OOD model selection without additional assumptions.  ( 2 min )
    Unified Embeddings of Structural and Functional Connectome via a Function-Constrained Structural Graph Variational Auto-Encoder. (arXiv:2207.02328v1 [q-bio.NC])
    Graph theoretical analyses have become standard tools in modeling functional and anatomical connectivity in the brain. With the advent of connectomics, the primary graphs or networks of interest are structural connectome (derived from DTI tractography) and functional connectome (derived from resting-state fMRI). However, most published connectome studies have focused on either structural or functional connectome, yet complementary information between them, when available in the same dataset, can be jointly leveraged to improve our understanding of the brain. To this end, we propose a function-constrained structural graph variational autoencoder (FCS-GVAE) capable of incorporating information from both functional and structural connectome in an unsupervised fashion. This leads to a joint low-dimensional embedding that establishes a unified spatial coordinate system for comparing across different subjects. We evaluate our approach using the publicly available OASIS-3 Alzheimer's disease (AD) dataset and show that a variational formulation is necessary to optimally encode functional brain dynamics. Further, the proposed joint embedding approach can more accurately distinguish different patient sub-populations than approaches that do not use complementary connectome information.  ( 2 min )
    Multi-Contrast MRI Segmentation Trained on Synthetic Images. (arXiv:2207.02469v1 [eess.IV])
    In our comprehensive experiments and evaluations, we show that it is possible to generate multiple contrast (even all synthetically) and use synthetically generated images to train an image segmentation engine. We showed promising segmentation results tested on real multi-contrast MRI scans when delineating muscle, fat, bone and bone marrow, all trained on synthetic images. Based on synthetic image training, our segmentation results were as high as 93.91\%, 94.11\%, 91.63\%, 95.33\%, for muscle, fat, bone, and bone marrow delineation, respectively. Results were not significantly different from the ones obtained when real images were used for segmentation training: 94.68\%, 94.67\%, 95.91\%, and 96.82\%, respectively.  ( 2 min )
    When does SGD favor flat minima? A quantitative characterization via linear stability. (arXiv:2207.02628v1 [stat.ML])
    The observation that stochastic gradient descent (SGD) favors flat minima has played a fundamental role in understanding implicit regularization of SGD and guiding the tuning of hyperparameters. In this paper, we provide a quantitative explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the flatness -- as measured by the Frobenius norm of the Hessian -- is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular geometry awareness of SGD noise: 1) the noise magnitude is proportional to loss value; 2) the noise directions concentrate in the sharp directions of local landscape. This property of SGD noise provably holds for linear networks and random feature models (RFMs) and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are justified by extensive numerical experiments.  ( 3 min )
    Compositional Generalization in Grounded Language Learning via Induced Model Sparsity. (arXiv:2207.02518v1 [cs.CL])
    We provide a study of how induced model sparsity can help achieve compositional generalization and better sample efficiency in grounded language learning problems. We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations. We show that standard neural architectures do not always yield compositional generalization. To address this, we design an agent that contains a goal identification module that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal. The output of the goal identification module is the input to a value iteration network planner. Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations. We examine the internal representations of our agent and find the correct correspondences between words in its dictionary and attributes in the environment.  ( 2 min )
    Ultra-Low-Bitrate Speech Coding with Pretrained Transformers. (arXiv:2207.02262v1 [cs.SD])
    Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate.  ( 2 min )
    voxel2vec: A Natural Language Processing Approach to Learning Distributed Representations for Scientific Data. (arXiv:2207.02565v1 [cs.LG])
    Relationships in scientific data, such as the numerical and spatial distribution relations of features in univariate data, the scalar-value combinations' relations in multivariate data, and the association of volumes in time-varying and ensemble data, are intricate and complex. This paper presents voxel2vec, a novel unsupervised representation learning model, which is used to learn distributed representations of scalar values/scalar-value combinations in a low-dimensional vector space. Its basic assumption is that if two scalar values/scalar-value combinations have similar contexts, they usually have high similarity in terms of features. By representing scalar values/scalar-value combinations as symbols, voxel2vec learns the similarity between them in the context of spatial distribution and then allows us to explore the overall association between volumes by transfer prediction. We demonstrate the usefulness and effectiveness of voxel2vec by comparing it with the isosurface similarity map of univariate data and applying the learned distributed representations to feature classification for multivariate data and to association analysis for time-varying and ensemble data.  ( 2 min )
    Query-Efficient Adversarial Attack Based on Latin Hypercube Sampling. (arXiv:2207.02391v1 [cs.CV])
    In order to be applicable in real-world scenario, Boundary Attacks (BAs) were proposed and ensured one hundred percent attack success rate with only decision information. However, existing BA methods craft adversarial examples by leveraging a simple random sampling (SRS) to estimate the gradient, consuming a large number of model queries. To overcome the drawback of SRS, this paper proposes a Latin Hypercube Sampling based Boundary Attack (LHS-BA) to save query budget. Compared with SRS, LHS has better uniformity under the same limited number of random samples. Therefore, the average on these random samples is closer to the true gradient than that estimated by SRS. Various experiments are conducted on benchmark datasets including MNIST, CIFAR, and ImageNet-1K. Experimental results demonstrate the superiority of the proposed LHS-BA over the state-of-the-art BA methods in terms of query efficiency. The source codes are publicly available at https://github.com/GZHU-DVL/LHS-BA.  ( 2 min )
    Distillation to Enhance the Portability of Risk Models Across Institutions with Large Patient Claims Database. (arXiv:2207.02445v1 [cs.LG])
    Artificial intelligence, and particularly machine learning (ML), is increasingly developed and deployed to support healthcare in a variety of settings. However, clinical decision support (CDS) technologies based on ML need to be portable if they are to be adopted on a broad scale. In this respect, models developed at one institution should be reusable at another. Yet there are numerous examples of portability failure, particularly due to naive application of ML models. Portability failure can lead to suboptimal care and medical errors, which ultimately could prevent the adoption of ML-based CDS in practice. One specific healthcare challenge that could benefit from enhanced portability is the prediction of 30-day readmission risk. Research to date has shown that deep learning models can be effective at modeling such risk. In this work, we investigate the practicality of model portability through a cross-site evaluation of readmission prediction models. To do so, we apply a recurrent neural network, augmented with self-attention and blended with expert features, to build readmission prediction models for two independent large scale claims datasets. We further present a novel transfer learning technique that adapts the well-known method of born-again network (BAN) training. Our experiments show that direct application of ML models trained at one institution and tested at another institution perform worse than models trained and tested at the same institution. We further show that the transfer learning approach based on the BAN produces models that are better than those trained on just a single institution's data. Notably, this improvement is consistent across both sites and occurs after a single retraining, which illustrates the potential for a cheap and general model transfer mechanism of readmission risk prediction.  ( 3 min )
    Generalization to translation shifts: a study in architectures and augmentations. (arXiv:2207.02349v1 [cs.CV])
    We provide a detailed evaluation of various image classification architectures (convolutional, vision transformer, and fully connected MLP networks) and data augmentation techniques towards generalization to large spacial translation shifts. We make the following observations: (a) In the absence of data augmentation, all architectures, including convolutional networks suffer degradation in performance when evaluated on translated test distributions. Understandably, both the in-distribution accuracy as well as degradation to shifts is significantly worse for non-convolutional architectures. (b) Across all architectures, even a minimal augmentation of $4$ pixel random crop improves the robustness of performance to much larger magnitude shifts of up to $1/4$ of image size ($8$-$16$ pixels) in the test data -- suggesting a form of meta generalization from augmentation. For non-convolutional architectures, while the absolute accuracy is still low, we see dramatic improvements in robustness to large translation shifts. (c) With sufficiently advanced augmentation ($4$ pixel crop+RandAugmentation+Erasing+MixUp) pipeline all architectures can be trained to have competitive performance, both in terms of in-distribution accuracy as well as generalization to large translation shifts.  ( 2 min )
    Improving Trustworthiness of AI Disease Severity Rating in Medical Imaging with Ordinal Conformal Prediction Sets. (arXiv:2207.02238v1 [cs.LG])
    The regulatory approval and broad clinical deployment of medical AI have been hampered by the perception that deep learning models fail in unpredictable and possibly catastrophic ways. A lack of statistically rigorous uncertainty quantification is a significant factor undermining trust in AI results. Recent developments in distribution-free uncertainty quantification present practical solutions for these issues by providing reliability guarantees for black-box models on arbitrary data distributions as formally valid finite-sample prediction intervals. Our work applies these new uncertainty quantification methods -- specifically conformal prediction -- to a deep-learning model for grading the severity of spinal stenosis in lumbar spine MRI. We demonstrate a technique for forming ordinal prediction sets that are guaranteed to contain the correct stenosis severity within a user-defined probability (confidence interval). On a dataset of 409 MRI exams processed by the deep-learning model, the conformal method provides tight coverage with small prediction set sizes. Furthermore, we explore the potential clinical applicability of flagging cases with high uncertainty predictions (large prediction sets) by quantifying an increase in the prevalence of significant imaging abnormalities (e.g. motion artifacts, metallic artifacts, and tumors) that could degrade confidence in predictive performance when compared to a random sample of cases.  ( 2 min )
    Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia. (arXiv:2207.02253v1 [cs.CL])
    While neural networks demonstrate a remarkable ability to model linguistic content, capturing contextual information related to a speaker's conversational role is an open area of research. In this work, we analyze the effect of speaker role on language use through the game of Mafia, in which participants are assigned either an honest or a deceptive role. In addition to building a framework to collect a dataset of Mafia game records, we demonstrate that there are differences in the language produced by players with different roles. We confirm that classification models are able to rank deceptive players as more suspicious than honest ones based only on their use of language. Furthermore, we show that training models on two auxiliary tasks outperforms a standard BERT-based text classification approach. We also present methods for using our trained models to identify features that distinguish between player roles, which could be used to assist players during the Mafia game.  ( 2 min )
    Information Compression and Performance Evaluation of Tic-Tac-Toe's Evaluation Function Using Singular Value Decomposition. (arXiv:2207.02449v1 [cs.LG])
    We approximated the evaluation function for the game Tic-Tac-Toe by singular value decomposition (SVD) and investigated the effect of approximation accuracy on winning rate. We first prepared the perfect evaluation function of Tic-Tac-Toe and performed low-rank approximation by considering the evaluation function as a ninth-order tensor. We found that we can reduce the amount of information of the evaluation function by 70% without significantly degrading the performance. Approximation accuracy and winning rate were strongly correlated but not perfectly proportional. We also investigated how the decomposition method of the evaluation function affects the performance. We considered two decomposition methods: simple SVD regarding the evaluation function as a matrix and the Tucker decomposition by higher-order SVD (HOSVD). At the same compression ratio, the strategy with the approximated evaluation function obtained by HOSVD exhibited a significantly higher winning rate than that obtained by SVD. These results suggest that SVD can effectively compress board game strategies and an optimal compression method that depends on the game exists.  ( 2 min )
    Many-body localized hidden Born machine. (arXiv:2207.02346v1 [quant-ph])
    Born Machines are quantum-inspired generative models that leverage the probabilistic nature of quantum states. Here, we present a new architecture called many-body localized (MBL) hidden Born machine that uses both MBL dynamics and hidden units as learning resources. We theoretically prove that MBL Born machines possess more expressive power than classical models, and the introduction of hidden units boosts its learning power. We numerically demonstrate that the MBL hidden Born machine is capable of learning a toy dataset consisting of patterns of MNIST handwritten digits, quantum data obtained from quantum many-body states, and non-local parity data. In order to understand the mechanism behind learning, we track physical quantities such as von Neumann entanglement entropy and Hamming distance during learning, and compare the learning outcomes in the MBL, thermal, and Anderson localized phases. We show that the superior learning power of the MBL phase relies importantly on both localization and interaction. Our architecture and algorithm provide novel strategies of utilizing quantum many-body systems as learning resources, and reveal a powerful connection between disorder, interaction, and learning in quantum systems.  ( 2 min )
    OpenLDN: Learning to Discover Novel Classes for Open-World Semi-Supervised Learning. (arXiv:2207.02261v1 [cs.CV])
    Semi-supervised learning (SSL) is one of the dominant approaches to address the annotation bottleneck of supervised learning. Recent SSL methods can effectively leverage a large repository of unlabeled data to improve performance while relying on a small set of labeled data. One common assumption in most SSL methods is that the labeled and unlabeled data are from the same underlying data distribution. However, this is hardly the case in many real-world scenarios, which limits their applicability. In this work, instead, we attempt to solve the recently proposed challenging open-world SSL problem that does not make such an assumption. In the open-world SSL problem, the objective is to recognize samples of known classes, and simultaneously detect and cluster samples belonging to novel classes present in unlabeled data. This work introduces OpenLDN that utilizes a pairwise similarity loss to discover novel classes. Using a bi-level optimization rule this pairwise similarity loss exploits the information available in the labeled set to implicitly cluster novel class samples, while simultaneously recognizing samples from known classes. After discovering novel classes, OpenLDN transforms the open-world SSL problem into a standard SSL problem to achieve additional performance gains using existing SSL methods. Our extensive experiments demonstrate that OpenLDN outperforms the current state-of-the-art methods on multiple popular classification benchmarks while providing a better accuracy/training time trade-off.  ( 3 min )
    GAMa: Cross-view Video Geo-localization. (arXiv:2207.02431v1 [cs.CV])
    The existing work in cross-view geo-localization is based on images where a ground panorama is matched to an aerial image. In this work, we focus on ground videos instead of images which provides additional contextual cues which are important for this task. There are no existing datasets for this problem, therefore we propose GAMa dataset, a large-scale dataset with ground videos and corresponding aerial images. We also propose a novel approach to solve this problem. At clip-level, a short video clip is matched with corresponding aerial image and is later used to get video-level geo-localization of a long video. Moreover, we propose a hierarchical approach to further improve the clip-level geolocalization. It is a challenging dataset, unaligned and limited field of view, and our proposed method achieves a Top-1 recall rate of 19.4% and 45.1% @1.0mile. Code and dataset are available at following link: https://github.com/svyas23/GAMa.  ( 2 min )
    Guiding Machine Perception with Psychophysics. (arXiv:2207.02241v1 [cs.CV])
    {G}{ustav} Fechner's 1860 delineation of psychophysics, the measurement of sensation in relation to its stimulus, is widely considered to be the advent of modern psychological science. In psychophysics, a researcher parametrically varies some aspects of a stimulus, and measures the resulting changes in a human subject's experience of that stimulus; doing so gives insight to the determining relationship between a sensation and the physical input that evoked it. This approach is used heavily in perceptual domains, including signal detection, threshold measurement, and ideal observer analysis. Scientific fields like vision science have always leaned heavily on the methods and procedures of psychophysics, but there is now growing appreciation of them by machine learning researchers, sparked by widening overlap between biological and artificial perception \cite{rojas2011automatic, scheirer2014perceptual,escalera2014chalearn,zhang2018agil, grieggs2021measuring}. Machine perception that is guided by behavioral measurements, as opposed to guidance restricted to arbitrarily assigned human labels, has significant potential to fuel further progress in artificial intelligence.  ( 2 min )
    EEPT: Early Discovery of Emerging Entities in Twitter with Semantic Similarity. (arXiv:2207.02434v1 [cs.CL])
    Some events which happen in the future could be important for companies, governments, and even our personal life. Prediction of these events before their establishment is helpful for efficient decision-making. We call such events emerging entities. They have not taken place yet, and there is no information about them in KB. However, some clues exist in different areas, especially on social media. Thus, retrieving these type of entities are possible. This paper proposes a method of early discovery of emerging entities. We use semantic clustering of short messages. To evaluate the performance of our proposal, we devise and utilize a performance evaluation metric. The results show that our proposed method finds those emerging entities of which Twitter trends are not always capable.  ( 2 min )
    Transfer Learning for Rapid Extraction of Thickness from Optical Spectra of Semiconductor Thin Films. (arXiv:2207.02209v1 [cs.LG])
    High-throughput experimentation with autonomous workflows, increasingly used to screen and optimize optoelectronic thin films, requires matching throughput of downstream characterizations. Despite being essential, thickness characterization lags in throughput. Although optical spectroscopic methods, e.g., spectrophotometry, provide quick measurements, a critical bottleneck is the ensuing manual fitting of optical oscillation models to the measured reflection and transmission. This study presents a machine-learning (ML) framework called thicknessML, which rapidly extracts film thickness from spectroscopic reflection and transmission. thicknessML leverages transfer learning to generalize to materials of different underlying optical oscillator models (i.e., different material classes).We demonstrate that thicknessML can extract film thickness from six perovskite samples in a two-stage process: (1) pre-training on a generic simulated dataset of Tauc-Lorentz oscillator, and (2) transfer learning to a simulated perovskite dataset of several literature perovskite refractive indices. Results show a pre-training thickness mean absolute percentage error (MAPE) of 5-7% and an experimental thickness MAPE of 6-19%.  ( 2 min )
    Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning. (arXiv:2207.02249v1 [cs.MA])
    Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks.  ( 2 min )
    Linear Jamming Bandits: Sample-Efficient Learning for Non-Coherent Digital Jamming. (arXiv:2207.02365v1 [cs.LG])
    It has been shown (Amuru et al. 2015) that online learning algorithms can be effectively used to select optimal physical layer parameters for jamming against digital modulation schemes without a priori knowledge of the victim's transmission strategy. However, this learning problem involves solving a multi-armed bandit problem with a mixed action space that can grow very large. As a result, convergence to the optimal jamming strategy can be slow, especially when the victim and jammer's symbols are not perfectly synchronized. In this work, we remedy the sample efficiency issues by introducing a linear bandit algorithm that accounts for inherent similarities between actions. Further, we propose context features which are well-suited for the statistical features of the non-coherent jamming problem and demonstrate significantly improved convergence behavior compared to the prior art. Additionally, we show how prior knowledge about the victim's transmissions can be seamlessly integrated into the learning framework. We finally discuss limitations in the asymptotic regime.  ( 2 min )
    Multi-Label Retinal Disease Classification using Transformers. (arXiv:2207.02335v1 [cs.CV])
    Early detection of retinal diseases is one of the most important means of preventing partial or permanent blindness in patients. In this research, a novel multi-label classification system is proposed for the detection of multiple retinal diseases, using fundus images collected from a variety of sources. First, a new multi-label retinal disease dataset, the MuReD dataset, is constructed, using a number of publicly available datasets for fundus disease classification. Next, a sequence of post-processing steps is applied to ensure the quality of the image data and the range of diseases, present in the dataset. For the first time in fundus multi-label disease classification, a transformer-based model optimized through extensive experimentation is used for image analysis and decision making. Numerous experiments are performed to optimize the configuration of the proposed system. It is shown that the approach performs better than state-of-the-art works on the same task by 7.9% and 8.1% in terms of AUC score for disease detection and disease classification, respectively. The obtained results further support the potential applications of transformer-based architectures in the medical imaging field.  ( 3 min )
    BioTABQA: Instruction Learning for Biomedical Table Question Answering. (arXiv:2207.02419v1 [cs.CL])
    Table Question Answering (TQA) is an important but under-explored task. Most of the existing QA datasets are in unstructured text format and only few of them use tables as the context. To the best of our knowledge, none of TQA datasets exist in the biomedical domain where tables are frequently used to present information. In this paper, we first curate a table question answering dataset, BioTABQA, using 22 templates and the context from a biomedical textbook on differential diagnosis. BioTABQA can not only be used to teach a model how to answer questions from tables but also evaluate how a model generalizes to unseen questions, an important scenario for biomedical applications. To achieve the generalization evaluation, we divide the templates into 17 training and 5 cross-task evaluations. Then, we develop two baselines using single and multi-tasks learning on BioTABQA. Furthermore, we explore instructional learning, a recent technique showing impressive generalizing performance. Experimental results show that our instruction-tuned model outperforms single and multi-task baselines on an average by ~23% and ~6% across various evaluation settings, and more importantly, instruction-tuned model outperforms baselines by ~5% on cross-tasks.  ( 2 min )
    Federated and Transfer Learning: A Survey on Adversaries and Defense Mechanisms. (arXiv:2207.02337v1 [cs.LG])
    The advent of federated learning has facilitated large-scale data exchange amongst machine learning models while maintaining privacy. Despite its brief history, federated learning is rapidly evolving to make wider use more practical. One of the most significant advancements in this domain is the incorporation of transfer learning into federated learning, which overcomes fundamental constraints of primary federated learning, particularly in terms of security. This chapter performs a comprehensive survey on the intersection of federated and transfer learning from a security point of view. The main goal of this study is to uncover potential vulnerabilities and defense mechanisms that might compromise the privacy and performance of systems that use federated and transfer learning.  ( 2 min )
    Towards Realistic Semi-Supervised Learning. (arXiv:2207.02269v1 [cs.CV])
    Deep learning is pushing the state-of-the-art in many computer vision applications. However, it relies on large annotated data repositories, and capturing the unconstrained nature of the real-world data is yet to be solved. Semi-supervised learning (SSL) complements the annotated training data with a large corpus of unlabeled data to reduce annotation cost. The standard SSL approach assumes unlabeled data are from the same distribution as annotated data. Recently, ORCA [9] introduce a more realistic SSL problem, called open-world SSL, by assuming that the unannotated data might contain samples from unknown classes. This work proposes a novel approach to tackle SSL in open-world setting, where we simultaneously learn to classify known and unknown classes. At the core of our method, we utilize sample uncertainty and incorporate prior knowledge about class distribution to generate reliable pseudo-labels for unlabeled data belonging to both known and unknown classes. Our extensive experimentation showcases the effectiveness of our approach on several benchmark datasets, where it substantially outperforms the existing state-of-the-art on seven diverse datasets including CIFAR-100 (17.6%), ImageNet-100 (5.7%), and Tiny ImageNet (9.9%).  ( 2 min )
    Swin Deformable Attention U-Net Transformer (SDAUT) for Explainable Fast MRI. (arXiv:2207.02390v1 [cs.CV])
    Fast MRI aims to reconstruct a high fidelity image from partially observed measurements. Exuberant development in fast MRI using deep learning has been witnessed recently. Meanwhile, novel deep learning paradigms, e.g., Transformer based models, are fast-growing in natural language processing and promptly developed for computer vision and medical image analysis due to their prominent performance. Nevertheless, due to the complexity of the Transformer, the application of fast MRI may not be straightforward. The main obstacle is the computational cost of the self-attention layer, which is the core part of the Transformer, can be expensive for high resolution MRI inputs. In this study, we propose a new Transformer architecture for solving fast MRI that coupled Shifted Windows Transformer with U-Net to reduce the network complexity. We incorporate deformable attention to construe the explainability of our reconstruction model. We empirically demonstrate that our method achieves consistently superior performance on the fast MRI task. Besides, compared to state-of-the-art Transformer models, our method has fewer network parameters while revealing explainability. The code is publicly available at https://github.com/ayanglab/SDAUT.  ( 2 min )
    Transformers are Adaptable Task Planners. (arXiv:2207.02442v1 [cs.RO])
    Every home is different, and every person likes things done in their particular way. Therefore, home robots of the future need to both reason about the sequential nature of day-to-day tasks and generalize to user's preferences. To this end, we propose a Transformer Task Planner(TTP) that learns high-level actions from demonstrations by leveraging object attribute-based representations. TTP can be pre-trained on multiple preferences and shows generalization to unseen preferences using a single demonstration as a prompt in a simulated dishwasher loading task. Further, we demonstrate real-world dish rearrangement using TTP with a Franka Panda robotic arm, prompted using a single human demonstration.  ( 2 min )
    State-Augmented Learnable Algorithms for Resource Management in Wireless Networks. (arXiv:2207.02242v1 [cs.LG])
    We consider resource management problems in multi-user wireless networks, which can be cast as optimizing a network-wide utility function, subject to constraints on the long-term average performance of users across the network. We propose a state-augmented algorithm for solving the aforementioned radio resource management (RRM) problems, where, alongside the instantaneous network state, the RRM policy takes as input the set of dual variables corresponding to the constraints, which evolve depending on how much the constraints are violated during execution. We theoretically show that the proposed state-augmented algorithm leads to feasible and near-optimal RRM decisions. Moreover, focusing on the problem of wireless power control using graph neural network (GNN) parameterizations, we demonstrate the superiority of the proposed RRM algorithm over baseline methods across a suite of numerical experiments.  ( 2 min )
  • Open

    Stochastic normalizing flows as non-equilibrium transformations. (arXiv:2201.08862v3 [hep-lat] UPDATED)
    Normalizing flows are a class of deep generative models that provide a promising route to sample lattice field theories more efficiently than conventional Monte Carlo simulations. In this work we show that the theoretical framework of stochastic normalizing flows, in which neural-network layers are combined with Monte Carlo updates, is the same that underlies out-of-equilibrium simulations based on Jarzynski's equality, which have been recently deployed to compute free-energy differences in lattice gauge theories. We lay out a strategy to optimize the efficiency of this extended class of generative models and present examples of applications.
    Distributional neural networks for electricity price forecasting. (arXiv:2207.02832v1 [q-fin.ST])
    We present a novel approach to probabilistic electricity price forecasting (EPF) which utilizes distributional artificial neural networks. The novel network structure for EPF is based on a regularized distributional multilayer perceptron (DMLP) which contains a probability layer. Using the TensorFlow Probability framework, the neural network's output is defined to be a distribution, either normal or potentially skewed and heavy-tailed Johnson's SU (JSU). The method is compared against state-of-the-art benchmarks in a forecasting study. The study comprises forecasting involving day-ahead electricity prices in the German market. The results show evidence of the importance of higher moments when modeling electricity prices.
    State-Augmented Learnable Algorithms for Resource Management in Wireless Networks. (arXiv:2207.02242v1 [cs.LG])
    We consider resource management problems in multi-user wireless networks, which can be cast as optimizing a network-wide utility function, subject to constraints on the long-term average performance of users across the network. We propose a state-augmented algorithm for solving the aforementioned radio resource management (RRM) problems, where, alongside the instantaneous network state, the RRM policy takes as input the set of dual variables corresponding to the constraints, which evolve depending on how much the constraints are violated during execution. We theoretically show that the proposed state-augmented algorithm leads to feasible and near-optimal RRM decisions. Moreover, focusing on the problem of wireless power control using graph neural network (GNN) parameterizations, we demonstrate the superiority of the proposed RRM algorithm over baseline methods across a suite of numerical experiments.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v2 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Epistemic Neural Networks. (arXiv:2107.08924v5 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. Conventional neural networks lack this capability and, since most research has focused on marginal predictions, this shortcoming has been largely overlooked. We introduce the epistemic neural network (ENN) as an interface for models that represent uncertainty as required to generate useful joint predictions. While prior approaches to uncertainty modeling such as Bayesian neural networks can be expressed as ENNs, this new interface facilitates comparison of joint predictions and the design of novel architectures and algorithms. In particular, we introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. We demonstrate this efficacy across synthetic data, ImageNet, and some reinforcement learning tasks. As part of this effort we open-source experiment code.
    Variational Flow Graphical Model. (arXiv:2207.02722v1 [stat.ML])
    This paper introduces a novel approach to embed flow-based models with hierarchical structures. The proposed framework is named Variational Flow Graphical (VFG) Model. VFGs learn the representation of high dimensional data via a message-passing scheme by integrating flow-based functions through variational inference. By leveraging the expressive power of neural networks, VFGs produce a representation of the data using a lower dimension, thus overcoming the drawbacks of many flow-based models, usually requiring a high dimensional latent space involving many trivial variables. Aggregation nodes are introduced in the VFG models to integrate forward-backward hierarchical information via a message passing scheme. Maximizing the evidence lower bound (ELBO) of data likelihood aligns the forward and backward messages in each aggregation node achieving a consistency node state. Algorithms have been developed to learn model parameters through gradient updating regarding the ELBO objective. The consistency of aggregation nodes enable VFGs to be applicable in tractable inference on graphical structures. Besides representation learning and numerical inference, VFGs provide a new approach for distribution modeling on datasets with graphical latent structures. Additionally, theoretical study shows that VFGs are universal approximators by leveraging the implicitly invertible flow-based structures. With flexible graphical structures and superior excessive power, VFGs could potentially be used to improve probabilistic inference. In the experiments, VFGs achieves improved evidence lower bound (ELBO) and likelihood values on multiple datasets.
    Improved conformalized quantile regression. (arXiv:2207.02808v1 [stat.ML])
    Conformalized quantile regression is a procedure that inherits the advantages of conformal prediction and quantile regression. That is, we use quantile regression to estimate the true conditional quantile and then apply a conformal step on a calibration set to ensure marginal coverage. In this way, we get adaptive prediction intervals that account for heteroscedasticity. However, the aforementioned conformal step lacks adaptiveness as described in (Romano et al., 2019). To overcome this limitation, instead of applying a single conformal step after estimating conditional quantiles with quantile regression, we propose to cluster the explanatory variables weighted by their permutation importance with an optimized k-means and apply k conformal steps. To show that this improved version outperforms the classic version of conformalized quantile regression and is more adaptive to heteroscedasticity, we extensively compare the prediction intervals of both in open datasets.
    Topological Information Retrieval with Dilation-Invariant Bottleneck Comparative Measures. (arXiv:2104.01672v3 [stat.ML] UPDATED)
    Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously characterize a database in terms of both its hierarchy and connectivity structure. Computing persistent homology on a variety of embedded datasets reveals that some commonly used embeddings fail to preserve the connectivity. We show that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect: in particular, they address the issue of metric distortion on manifolds. We provide an algorithm for their computation that exhibits greatly reduced time complexity over existing methods. We use these measures to perform the first instance of topology-based information retrieval and demonstrate its increased performance over the standard bottleneck distance for persistent homology. We showcase our approach on databases of different data varieties including text, videos, and medical images.
    Don't Pay Attention to the Noise: Learning Self-supervised Representations of Light Curves with a Denoising Time Series Transformer. (arXiv:2207.02777v1 [astro-ph.IM])
    Astrophysical light curves are particularly challenging data objects due to the intensity and variety of noise contaminating them. Yet, despite the astronomical volumes of light curves available, the majority of algorithms used to process them are still operating on a per-sample basis. To remedy this, we propose a simple Transformer model -- called Denoising Time Series Transformer (DTST) -- and show that it excels at removing the noise and outliers in datasets of time series when trained with a masked objective, even when no clean targets are available. Moreover, the use of self-attention enables rich and illustrative queries into the learned representations. We present experiments on real stellar light curves from the Transiting Exoplanet Space Satellite (TESS), showing advantages of our approach compared to traditional denoising techniques.
    Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design. (arXiv:2207.02575v1 [cs.LG])
    While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.
    PAC Prediction Sets for Meta-Learning. (arXiv:2207.02440v1 [cs.LG])
    Uncertainty quantification is a key component of machine learning models targeted at safety-critical systems such as in healthcare or autonomous vehicles. We study this problem in the context of meta learning, where the goal is to quickly adapt a predictor to new tasks. In particular, we propose a novel algorithm to construct \emph{PAC prediction sets}, which capture uncertainty via sets of labels, that can be adapted to new tasks with only a few training examples. These prediction sets satisfy an extension of the typical PAC guarantee to the meta learning setting; in particular, the PAC guarantee holds with high probability over future tasks. We demonstrate the efficacy of our approach on four datasets across three application domains: mini-ImageNet and CIFAR10-C in the visual domain, FewRel in the language domain, and the CDC Heart Dataset in the medical domain. In particular, our prediction sets satisfy the PAC guarantee while having smaller size compared to other baselines that also satisfy this guarantee.
    When does SGD favor flat minima? A quantitative characterization via linear stability. (arXiv:2207.02628v1 [stat.ML])
    The observation that stochastic gradient descent (SGD) favors flat minima has played a fundamental role in understanding implicit regularization of SGD and guiding the tuning of hyperparameters. In this paper, we provide a quantitative explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the flatness -- as measured by the Frobenius norm of the Hessian -- is bounded independently of the model size and sample size. The key to obtaining these results is exploiting the particular geometry awareness of SGD noise: 1) the noise magnitude is proportional to loss value; 2) the noise directions concentrate in the sharp directions of local landscape. This property of SGD noise provably holds for linear networks and random feature models (RFMs) and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are justified by extensive numerical experiments.
    Adaptive deep learning for nonparametric time series regression. (arXiv:2207.02546v1 [math.ST])
    In this paper, we develop a general theory for adaptive nonparametric estimation of mean functions of nonstationary and nonlinear time series using deep neural networks (DNNs). We first consider two types of DNN estimators, non-penalized and sparse-penalized DNN estimators, and establish their generalization error bounds for general nonstationary time series. We then derive minimax lower bounds for estimating mean functions belonging to a wide class of nonlinear autoregressive (AR) models that include nonlinear generalized additive AR, single index, and threshold AR models. Building upon the results, we show that the sparse-penalized DNN estimator is adaptive and attains the minimax optimal rates up to a poly-logarithmic factor for many nonlinear AR models. Through numerical simulations, we demonstrate the usefulness of the DNN methods for estimating nonlinear AR models with intrinsic low-dimensional structures and discontinuous or rough mean functions, which is consistent with our theory.
    Neural network stochastic differential equation models with applications to financial data forecasting. (arXiv:2111.13164v5 [cs.LG] UPDATED)
    In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called L\'evy induced stochastic differential equation network, which explores compounded stochastic differential equations with $\alpha$-stable L\'evy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove the convergence of our algorithm with respect to hyper-parameters of the neural network, and obtain the error bound without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian L\'evy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of L\'evy motion and the prediction lengths.
    Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture. (arXiv:2112.08534v2 [cs.LG] UPDATED)
    We introduce the Momentum Transformer, an attention-based deep learning architecture which outperforms benchmark momentum and mean-reversion trading strategies. Unlike state-of-the-art Long Short-Term Memory (LSTM) architectures, which are sequential in nature, the attention mechanism provides our architecture with a direct connection to all previous time-steps. Our architecture enables us to learn longer-term dependencies, improves performance when considering returns net of transaction costs and naturally adapts to new market regimes, such as during the SARS-CoV-2 crisis. The Momentum Transformer is inherently interpretable, providing us with greater insights into our deep learning momentum trading strategy, including how it blends different classical strategies and the past time-steps which are of the greatest significance to the model.
    Private Matrix Approximation and Geometry of Unitary Orbits. (arXiv:2207.02794v1 [cs.DS])
    Consider the following optimization problem: Given $n \times n$ matrices $A$ and $\Lambda$, maximize $\langle A, U\Lambda U^*\rangle$ where $U$ varies over the unitary group $\mathrm{U}(n)$. This problem seeks to approximate $A$ by a matrix whose spectrum is the same as $\Lambda$ and, by setting $\Lambda$ to be appropriate diagonal matrices, one can recover matrix approximation problems such as PCA and rank-$k$ approximation. We study the problem of designing differentially private algorithms for this optimization problem in settings where the matrix $A$ is constructed using users' private data. We give efficient and private algorithms that come with upper and lower bounds on the approximation error. Our results unify and improve upon several prior works on private matrix approximation problems. They rely on extensions of packing/covering number bounds for Grassmannians to unitary orbits which should be of independent interest.
    Reconstructing Nonlinear Dynamical Systems from Multi-Modal Time Series. (arXiv:2111.02922v3 [cs.LG] UPDATED)
    Empirically observed time series in physics, biology, or medicine, are commonly generated by some underlying dynamical system (DS) which is the target of scientific interest. There is an increasing interest to harvest machine learning methods to reconstruct this latent DS in a data-driven, unsupervised way. In many areas of science it is common to sample time series observations from many data modalities simultaneously, e.g. electrophysiological and behavioral time series in a typical neuroscience experiment. However, current machine learning tools for reconstructing DSs usually focus on just one data modality. Here we propose a general framework for multi-modal data integration for the purpose of nonlinear DS reconstruction and the analysis of cross-modal relations. This framework is based on dynamically interpretable recurrent neural networks as general approximators of nonlinear DSs, coupled to sets of modality-specific decoder models from the class of generalized linear models. Both an expectation-maximization and a variational inference algorithm for model training are advanced and compared. We show on nonlinear DS benchmarks that our algorithms can efficiently compensate for too noisy or missing information in one data channel by exploiting other channels, and demonstrate on experimental neuroscience data how the algorithm learns to link different data domains to the underlying dynamics.
    Many-body localized hidden Born machine. (arXiv:2207.02346v1 [quant-ph])
    Born Machines are quantum-inspired generative models that leverage the probabilistic nature of quantum states. Here, we present a new architecture called many-body localized (MBL) hidden Born machine that uses both MBL dynamics and hidden units as learning resources. We theoretically prove that MBL Born machines possess more expressive power than classical models, and the introduction of hidden units boosts its learning power. We numerically demonstrate that the MBL hidden Born machine is capable of learning a toy dataset consisting of patterns of MNIST handwritten digits, quantum data obtained from quantum many-body states, and non-local parity data. In order to understand the mechanism behind learning, we track physical quantities such as von Neumann entanglement entropy and Hamming distance during learning, and compare the learning outcomes in the MBL, thermal, and Anderson localized phases. We show that the superior learning power of the MBL phase relies importantly on both localization and interaction. Our architecture and algorithm provide novel strategies of utilizing quantum many-body systems as learning resources, and reveal a powerful connection between disorder, interaction, and learning in quantum systems.
    Linear Jamming Bandits: Sample-Efficient Learning for Non-Coherent Digital Jamming. (arXiv:2207.02365v1 [cs.LG])
    It has been shown (Amuru et al. 2015) that online learning algorithms can be effectively used to select optimal physical layer parameters for jamming against digital modulation schemes without a priori knowledge of the victim's transmission strategy. However, this learning problem involves solving a multi-armed bandit problem with a mixed action space that can grow very large. As a result, convergence to the optimal jamming strategy can be slow, especially when the victim and jammer's symbols are not perfectly synchronized. In this work, we remedy the sample efficiency issues by introducing a linear bandit algorithm that accounts for inherent similarities between actions. Further, we propose context features which are well-suited for the statistical features of the non-coherent jamming problem and demonstrate significantly improved convergence behavior compared to the prior art. Additionally, we show how prior knowledge about the victim's transmissions can be seamlessly integrated into the learning framework. We finally discuss limitations in the asymptotic regime.
    Expectation Distance-based Distributional Clustering for Noise-Robustness. (arXiv:2110.08871v3 [cs.LG] UPDATED)
    This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution and, in the process, reducing the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather raw data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the ED distance measure for the case when the uncertainty is Gaussian. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data are also presented, which involves efficiently extracting and using underlying uncertainty information in the form of means and variances (that, for example, is adequate to characterize Gaussian distributions). The results show striking performance improvement over classical clustering of raw data, with higher accuracy realized for ED. This is because while $W_2$ employs only the marginal distributions ignoring the correlations, the proposed ED also uses the joint distributions factoring the correlations into the distance measures.
    Integral Probability Metrics PAC-Bayes Bounds. (arXiv:2207.00614v2 [stat.ML] UPDATED)
    We present a PAC-Bayes-style generalization bound which enables the replacement of the KL-divergence with a variety of Integral Probability Metrics (IPM). We provide instances of this bound with the IPM being the total variation metric and the Wasserstein distance. A notable feature of the obtained bounds is that they naturally interpolate between classical uniform convergence bounds in the worst case (when the prior and posterior are far away from each other), and preferable bounds in better cases (when the posterior and prior are close). This illustrates the possibility of reinforcing classical generalization bounds with algorithm- and data-dependent components, thus making them more suitable to analyze algorithms that use a large hypothesis space.
    Instance-optimal PAC Algorithms for Contextual Bandits. (arXiv:2207.02357v1 [stat.ML])
    In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(\epsilon,\delta)$-$\textit{PAC}$ setting: given a policy class $\Pi$ the goal of the learner is to return a policy $\pi\in \Pi$ whose expected reward is within $\epsilon$ of the optimal policy with probability greater than $1-\delta$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $\rho_{\Pi}$, and provide matching upper and lower bounds in terms of $\rho_{\Pi}$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.
    Conditional Distribution Function Estimation Using Neural Networks for Censored and Uncensored Data. (arXiv:2207.02384v1 [stat.ME])
    Most work in neural networks focuses on estimating the conditional mean of a continuous response variable given a set of covariates.In this article, we consider estimating the conditional distribution function using neural networks for both censored and uncensored data. The algorithm is built upon the data structure particularly constructed for the Cox regression with time-dependent covariates. Without imposing any model assumption, we consider a loss function that is based on the full likelihood where the conditional hazard function is the only unknown nonparametric parameter, for which unconstraint optimization methods can be applied. Through simulation studies, we show the proposed method possesses desirable performance, whereas the partial likelihood method and the traditional neural networks with $L_2$ loss yield biased estimates when model assumptions are violated. We further illustrate the proposed method with several real-world data sets. The implementation of the proposed methods is made available at https://github.com/bingqing0729/NNCDE.

  • Open

    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]  ( 83 min )
    What are artificial intelligences that can automatically edit music, images, texts, beats in some way?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 84 min )
    I got some midjourney invites left !
    I don’t got any friends to give the invites to so who needs one! submitted by /u/projhect-AI [link] [comments]  ( 84 min )
    Is there an app/site/software that uses AI image recognition to organize images by similarity? I'm looking to sort a bunch of dall-e images
    Tried to explain as much as possible in the title. I did a "run" of DALL-E and I have already used photoshop's macros to crop each of them in a different file bc I feel like there's an interesting experience in watching it go through similar but different iteractions, but I would like it to be sorted by similarity to make the most impact. Can any of you recommend me a way to do that? The first result I found in google pinged the antivirus so I felt like getting recommendations was the way to go. Here's an example of that kind of images I'm talking about https://imgur.com/a/miG2WWZ submitted by /u/quiteawhile [link] [comments]  ( 84 min )
    Elon Musk: "I hope that the AI is nice to us ... I've lost a lot of sleep thinking about AI as an existential risk ... I think there should probably should be a regulatory agency that oversees advanced AI, because it's a public safety risk." (2-minute clip)
    submitted by /u/Farnectarine4825 [link] [comments]  ( 84 min )
    AI Dream 61 - EPIC Nebula Exploration by AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 84 min )
    Meta's latest open source AI can translate 200 languages
    submitted by /u/much_successes [link] [comments]  ( 85 min )
    Want to animate your photos from midjourney in 3D, high resolution 4k? Check out my new tutorial!
    submitted by /u/nalr00n [link] [comments]  ( 84 min )
    No Language Left Behind: Translating 200 languages with a single model - by Meta AI
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 84 min )
    when AGI hits its stride, the cost of all goods and services will fall
    submitted by /u/bartturner [link] [comments]  ( 84 min )
    Socially engineered
    I created this Reddit account in 2021 for Crypto only. It's not been used for about 7 months. I'm creating this thread to highlight a pattern I'm noticing in email sent from [noreply@redditmail.com](mailto:noreply@redditmail.com). Every email sent in 2021 up to October was totally related to my Crypto interest and activity here on Reddit. From October 2nd 2021 to April 22nd 2022 there was a gap where Reddit did not send any highlights or promotional emails. It appears all of the emails sent this year know more about me than I have ever shared on Reddit. While I was modding a 3DS it sent 3DS suggestions. Since December I've been experimenting with GPT-3, and now I'm suggested content from this forum. Since childhood I've had a passionate interest in robotics, AI, and software/hardware in g…  ( 103 min )
    “Universal explainers”
    What do you think of David Deutsch's theory of “universal explainers”? https://www.lesswrong.com/posts/HDyePg6oySYQ9hY4i/david-deutsch-on-universal-explainers-and-ai submitted by /u/Equal-Lingonberry517 [link] [comments]  ( 83 min )
    Websites/Programs for testing artificial intelligence
    What are the sites/programs that you can test some kind of artificial intelligence for free and uncomplicated? submitted by /u/NaturalMagicCat [link] [comments]  ( 84 min )
  • Open

    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]  ( 84 min )
    A Tutorial on Using Using Neural Style PT to Transfer the Style of One Image to Another
    View the tutorial here: HERE This tutorial teaches you how to transfer the style of one image to another image using neural-style-pt. Below is a imgur gallery showing off the transformation process. https://imgur.com/gallery/iMlkkQi Let me know if you have any questions or comments. submitted by /u/mshriver2 [link] [comments]  ( 84 min )
    Has anyone tried using an external NVIDIA GPU for machine learning on a MacBook Pro?
    submitted by /u/PopOk539 [link] [comments]  ( 84 min )
  • Open

    Break through language barriers with Amazon Transcribe, Amazon Translate, and Amazon Polly
    Imagine a surgeon taking video calls with patients across the globe without the need of a human translator. What if a fledgling startup could easily expand their product across borders and into new geographical markets by offering fluid, accurate, multilingual customer support and sales, all without the need of a live human translator? What happens […]  ( 10 min )
  • Open

    Dijkstra extends Pythagoras
    Suppose a triangle has sides a, b, and c. Label the angles opposite these three sides α, β, and γ respectively. Edsger Dijkstra published (EWD975-0) a note proving the following extension of the Pythagorean theorem: sgn(α + β – γ) = sgn(a² + b² – c²). Here the sgn function is -1, 0, or 1 […] Dijkstra extends Pythagoras first appeared on John D. Cook.  ( 4 min )
  • Open

    [D] How would you measure the correlation of the gradient across iterations?
    One simple thing one could do is take the dot product between the current and the n-1 gradient. But this will of course not be very meaningful as what really matters is a (sort-of) average correlation across several iterations, which will not be revealed from doing such a local comparison (using gradients from step n and n-1). Ideally it would be a calculation that would not require keeping around old gradients. Any ideas? submitted by /u/fasttosmile [link] [comments]  ( 85 min )
    [D] Handling OOV in sequence generation
    What are some methods to handle OOV words when generating sequences? For example for some n-gram implementations, I've seen all tokens removed from the candidate list of words to be sampled from given the prior n-gram, and if there are no other candidates the generated text is ended. Curious to learn about some other methods to deal with OOV. submitted by /u/MLJungle [link] [comments]  ( 85 min )
    [R] CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
    Paper: https://arxiv.org/pdf/2207.01780.pdf Github: https://github.com/salesforce/CodeRL Abstract: Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark. https://preview.redd.it/goglny8a30a91.jpg?width=1218&format=pjpg&auto=webp&s=a6f50319637cf85fed2de1d08b407478f6a227aa https://preview.redd.it/vav9glra30a91.jpg?width=1234&format=pjpg&auto=webp&s=19ef106847c090fab438338fad912f1afd75db1a submitted by /u/Singularian2501 [link] [comments]  ( 86 min )
    [D] Why aren't there much people working on causal machine learning?
    It seems Judea Pearl, Yoshua Bengio, Elias Bareinboim and a handful of other researchers are only people who are working on causal inference and machine learning. Is causal machine learning still a niche field? Also, do you know any researcher working on causal machine learning at Berkeley? submitted by /u/After_Philosopher572 [link] [comments]  ( 87 min )
    [D] Object Detection trained on simulated renderings unable to converge on real images - why?
    I wrote a program in Unity that generated millions of fake images using the HDRP rendering pipeline. For starters I only want to detect a bottle of "ITO EN" ice-tea. Here is an example (left is real, right is the fake rendering). I have a simple 3 layer resnet CNN with 3 blocks each, and use a Global Average Pooling layer at the end to visualize the detection. Using the simulation dataset only I get an accuracy of 97% or higher. Using the real dataset I only get ~70% accuracy. I wanna add that this is not a result of over-training, (a) because I use validation set and stop training if it hasn't improved and d (b) the test set performs very well. This is infuriating, because the image dataset is extremely diverse and I use a ton of image transformations in order to provide a very high level of diversity. I also use various levels of lighting, bloom, camera exposures, motion blur, changing materials for all assets, as well as changing the properties for the target (the bottle), such as glossiness, reflection, emissive lighting, and so on. Here is an example for the rendered dataset that is used for training, and here is an example for the real dataset. Anyone got an idea why this isn't working out? submitted by /u/tmuxed [link] [comments]  ( 88 min )
    [D] How to correctly transform Cityscapes Masks to Bounding Boxes?
    As the title suggests, I would like to know the correct way to pre-process the cityscapes dataset for object detection. There are multiple ways how this can be done. There is a version in Detectron2, in MM Detection, there is this. Which one is the correct way, without getting errors in the labels? Anybody worked with this before? Would be glad if anybody might have an idea. submitted by /u/SeucheAchat9115 [link] [comments]  ( 86 min )
    [P] Tutorial: Serverless MLOps pipelines with Vertex AI and ZenML
    At ZenML, we created a guide to easily run MLOps pipelines on Google Cloud Platform with Vertex AI. I thought I'd share it here because I think it might be useful for people who are just starting MLOps on GCP. Blog post: https://blog.zenml.io/vertex-ai-blog/ Full video: https://youtu.be/qgvmvexGv_c Why is this better than going through the Vertex AI SDK? ZenML steps and pipeline can be written with a simple decorator pattern that is easily approachable for a #datascientist. ZenML takes cares of storing and versioning pythonic objects between steps of a ZenML provides first-class integrations into other MLOps tools that you can leverage natively in your pipelines. For example, you can track experiments on MLFlow easily. ZenML pipelines can be run locally first, and then deployed instantly. You can run a ZenML pipeline not only on Vertex, but also #Airflow, #Kubeflow, #Kubernetes, or whereever else you'd like!📷Watch the full video: https://www.youtube.com/watch?v=qgvmvexGv_c&ab_channel=ZenML I bet the GCP Vertex AI folk here might like the above video. It isn't just about ZenML either but more of a broader look into the different components that go into running ML in production on GCP (Container registry, Cloud Storage, Secret Manager, Vertex, Cloud SQL) Would love to hear more feedback on the video or blog! submitted by /u/htahir1 [link] [comments]  ( 86 min )
    [P] Some new Sherlock Holmes stories (GPT-3)
    I thought I share some of the Sherlock Holmes stories I created with various prompts using GPT-3. Might be fun for some fans, but overall the stories are all a bit superficial, although some of them made me laugh (the fourth one in particular). ​ John Watson was having a cup of tea in his flat when he heard a knock at the door. He got up to answer it, and found Sherlock Holmes standing there, looking rather grave. "Watson, I'm afraid I have some bad news," said Sherlock. "I've just been to Baker Street, where I found Mrs. Hudson in a state of hysterics. It seems that a Mr. Bartholomew Jones was found dead in his study, and Mrs. Hudson is convinced that it was murder." "That is rather shocking," said Watson. "Do you have any idea who might have done it?" "I have some suspicions," …  ( 100 min )
    [R] Detectron2 STMDA-RetinaNet
    Hello, i am happy to share with you one of my latest work for domain adaptation built on top of Detectron2 object detector model (RetinaNet). Link to the github repo STMDA-RetinaNet: https://github.com/fpv-iplab/STMDA-RetinaNet submitted by /u/CapitalShake3085 [link] [comments]  ( 85 min )
    [R] How Machine Learning is Used in Finance and Banking
    Machine learning solutions are already embedded in the finance and banking industry. In this article, we reviewed the most popular use cases of ML in banking and shared practical tips on how to implement it into your business.https://exadel.com/news/how-machine-learning-is-used-in-finance-and-banking submitted by /u/lklimusheuskaja [link] [comments]  ( 85 min )
    Jupyter Notebook Competition coming up! [News]
    The Jupyter Notebook Competition deadline is fast approaching! https://preview.redd.it/gy6m0myhyx991.png?width=1920&format=png&auto=webp&s=3039abe962df07df74740772994f17502fa686bb Don't miss out on your chance to contribute to a community-driven resource of notebooks on the Copernicus WEkEO platform, AND be in with a chance of winning cash prizes! Visit: https://www.eumetsat.int/features/new-jupyter-notebook-competition submitted by /u/EUMETSAT [link] [comments]  ( 85 min )
    [News] Ian Goodfellow joins DeepMind as a Research Scientist
    Per his tweet at https://twitter.com/goodfellow_ian/status/1544638709039091717, Goodfellow will be a research scientist under Oriol Vinyals' Deep Learning team. submitted by /u/The_Removed [link] [comments]  ( 90 min )
    [P] Comparing DevOps into MLOps to analyse tools doing well in the market
    Hi all, I've been an active practitioner in Deep Learning and then wanted to build something in MLOps. So wanted to dig deeper in how DevOps evolved and wanted to check if MLOps can take the same path. The findings are really great. Absolutely every tool doing well in the market is a clear replacement for DevOps tool in MLOps. Here is my blog on it. Looking for feedback. If you have any comments, let me know. Will add them. https://sachinchandra.substack.com/p/bringing-software-development-principles submitted by /u/scb_11 [link] [comments]  ( 89 min )
  • Open

    MLGO: A Machine Learning Framework for Compiler Optimization
    Posted by Yundi Qian, Software Engineer, Google Research and Mircea Trofin, Software Engineer, Google Core The question of how to compile faster and smaller code arose together with the birth of modem computers. Better code optimization can significantly reduce the operational cost of large datacenter applications. The size of compiled code matters the most to mobile and embedded systems or software deployed on secure boot partitions, where the compiled binary must fit in tight code size budgets. With advances in the field, the headroom has been heavily squeezed with increasingly complicated heuristics, impeding maintenance and further improvements. Recent research has shown that machine learning (ML) can unlock more opportunities in compiler optimization by replacing complicated heuri…  ( 25 min )
  • Open

    "Offline RL Policies Should be Trained to be Adaptive", Ghosh et al 2022
    submitted by /u/gwern [link] [comments]  ( 84 min )
    Reinforcement Learning without Reward Engineering
    submitted by /u/Euphetar [link] [comments]  ( 84 min )
    d4rl PyTorch Dataloader
    I need to load some offline RL data, which is accessible via a similar interface as `d4rl`. It uses a HDF5 file for storage under the hood. I want to write a Dataloader in PyTorch, which is something I haven't done before for custom data. I have started implementing a custom subclass to PyTorch's `Dataset`. In the docs it says that `__getitem__` shall return one example at the given index. I'm worried that naively getting one data point from the HDF5 file and returning that will be way too slow. Am I going to have to come up with a very smart `__getitem__` function that loads more than required from disk, saves that in a smart data structure, and next time checks that data structure first before issuing a I/O request? Edit: typo submitted by /u/lemlo100 [link] [comments]  ( 84 min )
    Multi-Armed Bandit versions
    Hello everyone! I just started working with multi-armed bandits. I have two directions I could explore, and if anyone know any resources (book, research papers. etc) that would be awesome! I would like to implement multiple agent which can share knowledge with each other. For example, two agents who sell ice cream. They want to offer the best flavor, but sell at different locations which can affect which is the actual best flavor at that location. So the best option might not be the same, but if a lot of costumers start to buy a specific flavor at one place, it might be worth exploring for the other agent. Trying to determine best price (continuous value), arms are now placed at different prices. Since we now have distinct prices, the actual best price will most likely not be among the options. How could one tackle this problem? I’m not expecting anyone to take the time and explain my problems, but if you know of any good resources, please share! Thanks in advance! :) submitted by /u/AnkanTV [link] [comments]  ( 85 min )
  • Open

    Art by Artificial Intelligence: AI Generated Paintings
    AI has brought a new life to art.  ( 7 min )
    Your Predictions Are Only As Good As Your Data
    Testing Data Vs Training Data In Machine Learning  ( 14 min )
  • Open

    Startup lets doctors classify skin conditions with the snap of a picture
    Piction Health, founded by Susan Conover SM ’15, uses machine learning to help physicians identify and manage skin disease.  ( 8 min )
  • Open

    An Empirical Study of Implicit Regularization in Deep Offline RL. (arXiv:2207.02099v1 [cs.LG])
    Deep neural networks are the most commonly used function approximators in offline Reinforcement Learning these days. Prior works have shown that neural nets trained with TD-learning and gradient descent can exhibit implicit regularization that can be characterized by under-parameterization of these networks. Specifically, the rank of the penultimate feature layer, also called \textit{effective rank}, has been observed to drastically collapse during the training. In turn, this collapse has been argued to reduce the model's ability to further adapt in later stages of learning, leading to the diminished final performance. Such an association between the effective rank and performance makes effective rank compelling for offline RL, primarily for offline policy evaluation. In this work, we conduct a careful empirical study on the relation between effective rank and performance on three offline RL datasets : bsuite, Atari, and DeepMind lab. We observe that a direct association exists only in restricted settings and disappears in the more extensive hyperparameter sweeps. Also, we empirically identify three phases of learning that explain the impact of implicit regularization on the learning dynamics and found that bootstrapping alone is insufficient to explain the collapse of the effective rank. Further, we show that several other factors could confound the relationship between effective rank and performance and conclude that studying this association under simplistic assumptions could be highly misleading.  ( 3 min )
    Regret analysis of the Piyavskii-Shubert algorithm for global Lipschitz optimization. (arXiv:2002.02390v4 [cs.LG] UPDATED)
    We consider the problem of maximizing a non-concave Lipschitz multivariate function over a compact domain by sequentially querying its (possibly perturbed) values. We study a natural algorithm designed originally by Piyavskii and Shubert in 1972, for which we prove new bounds on the number of evaluations of the function needed to reach or certify a given optimization accuracy. Our analysis uses a bandit-optimization viewpoint and solves an open problem from Hansen et al.\ (1991) by bounding the number of evaluations to certify a given accuracy with a near-optimal sum of packing numbers.  ( 2 min )
    De-Biasing Generative Models using Counterfactual Methods. (arXiv:2207.01575v2 [cs.LG] UPDATED)
    Variational autoencoders (VAEs) and other generative methods have garnered growing interest not just for their generative properties but also for the ability to dis-entangle a low-dimensional latent variable space. However, few existing generative models take causality into account. We propose a new decoder based framework named the Causal Counterfactual Generative Model (CCGM), which includes a partially trainable causal layer in which a part of a causal model can be learned without significantly impacting reconstruction fidelity. By learning the causal relationships between image semantic labels or tabular variables, we can analyze biases, intervene on the generative model, and simulate new scenarios. Furthermore, by modifying the causal structure, we can generate samples outside the domain of the original training data and use such counterfactual models to de-bias datasets. Thus, datasets with known biases can still be used to train the causal generative model and learn the causal relationships, but we can produce de-biased datasets on the generative side. Our proposed method combines a causal latent space VAE model with specific modification to emphasize causal fidelity, enabling finer control over the causal layer and the ability to learn a robust intervention framework. We explore how better disentanglement of causal learning and encoding/decoding generates higher causal intervention quality. We also compare our model against similar research to demonstrate the need for explicit generative de-biasing beyond interventions. Our initial experiments show that our model can generate images and tabular data with high fidelity to the causal framework and accommodate explicit de-biasing to ignore undesired relationships in the causal data compared to the baseline.  ( 3 min )
    Benchmarking Deep AUROC Optimization: Loss Functions and Algorithmic Choices. (arXiv:2203.14177v3 [cs.LG] UPDATED)
    The area under the ROC curve (AUROC) has been vigorously applied for imbalanced classification and moreover combined with deep learning techniques. However, there is no existing work that provides sound information for peers to choose appropriate deep AUROC maximization techniques. In this work, we fill this gap from three aspects. (i) We benchmark a variety of loss functions with different algorithmic choices for deep AUROC optimization problem. We study the loss functions in two categories: pairwise loss and composite loss, which includes a total of 10 loss functions. Interestingly, we find composite loss, as an innovative loss function class, shows more competitive performance than pairwise loss from both training convergence and testing generalization perspectives. Nevertheless, data with more corrupted labels favors a pairwise symmetric loss. (ii) Moreover, we benchmark and highlight the essential algorithmic choices such as positive sampling rate, regularization, normalization/activation, and optimizers. Key findings include: higher positive sampling rate is likely to be beneficial for deep AUROC maximization; different datasets favors different weights of regularizations; appropriate normalization techniques, such as sigmoid and $\ell_2$ score normalization, could improve model performance. (iii) For optimization aspect, we benchmark SGD-type, Momentum-type, and Adam-type optimizers for both pairwise and composite loss. Our findings show that although Adam-type method is more competitive from training perspective, but it does not outperform others from testing perspective.  ( 3 min )
    Accelerating Hamiltonian Monte Carlo via Chebyshev Integration Time. (arXiv:2207.02189v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a popular method in sampling. While there are quite a few works of studying this method on various aspects, an interesting question is how to choose its integration time to achieve acceleration. In this work, we consider accelerating the process of sampling from a distribution $\pi(x) \propto \exp(-f(x))$ via HMC via time-varying integration time. When the potential $f$ is $L$-smooth and $m$-strongly convex, i.e.\ for sampling from a log-smooth and strongly log-concave target distribution $\pi$, it is known that under a constant integration time, the number of iterations that ideal HMC takes to get an $\epsilon$ Wasserstein-2 distance to the target $\pi$ is $O( \kappa \log \frac{1}{\epsilon} )$, where $\kappa := \frac{L}{m}$ is the condition number. We propose a scheme of time-varying integration time based on the roots of Chebyshev polynomials. We show that in the case of quadratic potential $f$, i.e., when the target $\pi$ is a Gaussian distribution, ideal HMC with this choice of integration time only takes $O( \sqrt{\kappa} \log \frac{1}{\epsilon} )$ number of iterations to reach Wasserstein-2 distance less than $\epsilon$; this improvement on the dependence on condition number is akin to acceleration in optimization. The design and analysis of HMC with the proposed integration time is built on the tools of Chebyshev polynomials. Experiments find the advantage of adopting our scheme of time-varying integration time even for sampling from distributions with smooth strongly convex potentials that are not quadratic.  ( 3 min )
    Data-driven synchronization-avoiding algorithms in the explicit distributed structural analysis of soft tissue. (arXiv:2207.02194v1 [cs.DC])
    We propose a data-driven framework to increase the computational efficiency of the explicit finite element method in the structural analysis of soft tissue. An encoder-decoder long short-term memory deep neural network is trained based on the data produced by an explicit, distributed finite element solver. We leverage this network to predict synchronized displacements at shared nodes, minimizing the amount of communication between processors. We perform extensive numerical experiments to quantify the accuracy and stability of the proposed synchronization-avoiding algorithm.  ( 2 min )
    Learning Stochastic Shortest Path with Linear Function Approximation. (arXiv:2110.12727v3 [cs.LG] UPDATED)
    We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain an $\tilde{\mathcal{O}}(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_{\star}$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, and an $\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves an $\tilde{\mathcal{O}}(d B_{\star}\sqrt{K/c_{\min}})$ regret. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(dB_{\star} \sqrt{K})$. Hence, our improved algorithm matches the lower bound up to a $1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.  ( 3 min )
    $\pi$VAE: a stochastic process prior for Bayesian deep learning with MCMC. (arXiv:2002.06873v5 [cs.LG] UPDATED)
    Stochastic processes provide a mathematically elegant way model complex data. In theory, they provide flexible priors over function classes that can encode a wide range of interesting assumptions. In practice, however, efficient inference by optimisation or marginalisation is difficult, a problem further exacerbated with big data and high dimensional input spaces. We propose a novel variational autoencoder (VAE) called the prior encoding variational autoencoder ($\pi$VAE). The $\pi$VAE is finitely exchangeable and Kolmogorov consistent, and thus is a continuous stochastic process. We use $\pi$VAE to learn low dimensional embeddings of function classes. We show that our framework can accurately learn expressive function classes such as Gaussian processes, but also properties of functions to enable statistical inference (such as the integral of a log Gaussian process). For popular tasks, such as spatial interpolation, $\pi$VAE achieves state-of-the-art performance both in terms of accuracy and computational efficiency. Perhaps most usefully, we demonstrate that the low dimensional independently distributed latent space representation learnt provides an elegant and scalable means of performing Bayesian inference for stochastic processes within probabilistic programming languages such as Stan.  ( 3 min )
    The StarCraft Multi-Agent Challenges+ : Learning of Multi-Stage Tasks and Environmental Factors without Precise Reward Functions. (arXiv:2207.02007v1 [cs.LG])
    In this paper, we propose a novel benchmark called the StarCraft Multi-Agent Challenges+, where agents learn to perform multi-stage tasks and to use environmental factors without precise reward functions. The previous challenges (SMAC) recognized as a standard benchmark of Multi-Agent Reinforcement Learning are mainly concerned with ensuring that all agents cooperatively eliminate approaching adversaries only through fine manipulation with obvious reward functions. This challenge, on the other hand, is interested in the exploration capability of MARL algorithms to efficiently learn implicit multi-stage tasks and environmental factors as well as micro-control. This study covers both offensive and defensive scenarios. In the offensive scenarios, agents must learn to first find opponents and then eliminate them. The defensive scenarios require agents to use topographic features. For example, agents need to position themselves behind protective structures to make it harder for enemies to attack. We investigate MARL algorithms under SMAC+ and observe that recent approaches work well in similar settings to the previous challenges, but misbehave in offensive scenarios. Additionally, we observe that an enhanced exploration approach has a positive effect on performance but is not able to completely solve all scenarios. This study proposes new directions for future research.  ( 3 min )
    Probability density estimation for sets of large graphs with respect to spectral information using stochastic block models. (arXiv:2207.02168v1 [cs.LG])
    For graph-valued data sampled iid from a distribution $\mu$, the sample moments are computed with respect to a choice of metric. In this work, we equip the set of graphs with the pseudo-metric defined by the $\ell_2$ norm between the eigenvalues of the respective adjacency matrices. We use this pseudo metric and the respective sample moments of a graph valued data set to infer the parameters of a distribution $\hat{\mu}$ and interpret this distribution as an approximation of $\mu$. We verify experimentally that complex distributions $\mu$ can be approximated well taking this approach.  ( 2 min )
    Federated Split GANs. (arXiv:2207.01750v1 [cs.LG])
    Mobile devices and the immense amount and variety of data they generate are key enablers of machine learning (ML)-based applications. Traditional ML techniques have shifted toward new paradigms such as federated (FL) and split learning (SL) to improve the protection of user's data privacy. However, these paradigms often rely on server(s) located in the edge or cloud to train computationally-heavy parts of a ML model to avoid draining the limited resource on client devices, resulting in exposing device data to such third parties. This work proposes an alternative approach to train computationally-heavy ML models in user's devices themselves, where corresponding device data resides. Specifically, we focus on GANs (generative adversarial networks) and leverage their inherent privacy-preserving attribute. We train the discriminative part of a GAN with raw data on user's devices, whereas the generative model is trained remotely (e.g., server) for which there is no need to access sensor true data. Moreover, our approach ensures that the computational load of training the discriminative model is shared among user's devices-proportional to their computation capabilities-by means of SL. We implement our proposed collaborative training scheme of a computationally-heavy GAN model in real resource-constrained devices. The results show that our system preserves data privacy, keeps a short training time, and yields same accuracy of model training in unconstrained devices (e.g., cloud). Our code can be found on https://github.com/YukariSonz/FSL-GAN  ( 3 min )
    Task-agnostic Defense against Adversarial Patch Attacks. (arXiv:2207.01795v1 [cs.CV])
    Adversarial patch attacks mislead neural networks by injecting adversarial pixels within a designated local region. Patch attacks can be highly effective in a variety of tasks and physically realizable via attachment (e.g. a sticker) to the real-world objects. Despite the diversity in attack patterns, adversarial patches tend to be highly textured and different in appearance from natural images. We exploit this property and present PatchZero, a task-agnostic defense against white-box adversarial patches. Specifically, our defense detects the adversarial pixels and "zeros out" the patch region by repainting with mean pixel values. We formulate the patch detection problem as a semantic segmentation task such that our model can generalize to patches of any size and shape. We further design a two-stage adversarial training scheme to defend against the stronger adaptive attacks. We thoroughly evaluate PatchZero on the image classification (ImageNet, RESISC45), object detection (PASCAL VOC), and video classification (UCF101) datasets. Our method achieves SOTA robust accuracy without any degradation in the benign performance.  ( 2 min )
    Individual Topology Structure of Eye Movement Trajectories. (arXiv:2205.10667v4 [cs.CV] UPDATED)
    Traditionally, extracting patterns from eye movement data relies on statistics of different macro-events such as fixations and saccades. This requires an additional preprocessing step to separate the eye movement subtypes, often with a number of parameters on which the classification results depend. Besides that, definitions of such macro events are formulated in different ways by different researchers. We propose an application of a new class of features to the quantitative analysis of personal eye movement trajectories structure. This new class of features based on algebraic topology allows extracting patterns from different modalities of gaze such as time series of coordinates and amplitudes, heatmaps, and point clouds in a unified way at all scales from micro to macro. We experimentally demonstrate the competitiveness of the new class of features with the traditional ones and their significant synergy while being used together for the person authentication task on the recently published eye movement trajectories dataset.  ( 2 min )
    Predicting Out-of-Domain Generalization with Local Manifold Smoothness. (arXiv:2207.02093v1 [cs.LG])
    Understanding how machine learning models generalize to new environments is a critical part of their safe deployment. Recent work has proposed a variety of complexity measures that directly predict or theoretically bound the generalization capacity of a model. However, these methods rely on a strong set of assumptions that in practice are not always satisfied. Motivated by the limited settings in which existing measures can be applied, we propose a novel complexity measure based on the local manifold smoothness of a classifier. We define local manifold smoothness as a classifier's output sensitivity to perturbations in the manifold neighborhood around a given test point. Intuitively, a classifier that is less sensitive to these perturbations should generalize better. To estimate smoothness we sample points using data augmentation and measure the fraction of these points classified into the majority class. Our method only requires selecting a data augmentation method and makes no other assumptions about the model or data distributions, meaning it can be applied even in out-of-domain (OOD) settings where existing methods cannot. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our manifold smoothness measure and actual OOD generalization on over 3,000 models evaluated on over 100 train/test domain pairs.  ( 3 min )
    Multi-Agent Broad Reinforcement Learning for Intelligent Traffic Light Control. (arXiv:2203.04310v2 [cs.LG] UPDATED)
    Intelligent Traffic Light Control System (ITLCS) is a typical Multi-Agent System (MAS), which comprises multiple roads and traffic lights.Constructing a model of MAS for ITLCS is the basis to alleviate traffic congestion. Existing approaches of MAS are largely based on Multi-Agent Deep Reinforcement Learning (MADRL). Although the Deep Neural Network (DNN) of MABRL is effective, the training time is long, and the parameters are difficult to trace. Recently, Broad Learning Systems (BLS) provided a selective way for learning in the deep neural networks by a flat network. Moreover, Broad Reinforcement Learning (BRL) extends BLS in Single Agent Deep Reinforcement Learning (SADRL) problem with promising results. However, BRL does not focus on the intricate structures and interaction of agents. Motivated by the feature of MADRL and the issue of BRL, we propose a Multi-Agent Broad Reinforcement Learning (MABRL) framework to explore the function of BLS in MAS. Firstly, unlike most existing MADRL approaches, which use a series of deep neural networks structures, we model each agent with broad networks. Then, we introduce a dynamic self-cycling interaction mechanism to confirm the "3W" information: When to interact, Which agents need to consider, What information to transmit. Finally, we do the experiments based on the intelligent traffic light control scenario. We compare the MABRL approach with six different approaches, and experimental results on three datasets verify the effectiveness of MABRL.  ( 3 min )
    Multi-Scored Sleep Databases: How to Exploit the Multiple-Labels in Automated Sleep Scoring. (arXiv:2207.01910v1 [cs.LG])
    Study Objectives: Inter-scorer variability in scoring polysomnograms is a well-known problem. Most of the existing automated sleep scoring systems are trained using labels annotated by a single scorer, whose subjective evaluation is transferred to the model. When annotations from two or more scorers are available, the scoring models are usually trained on the scorer consensus. The averaged scorer's subjectivity is transferred into the model, losing information about the internal variability among different scorers. In this study, we aim to insert the multiple-knowledge of the different physicians into the training procedure.The goal is to optimize a model training, exploiting the full information that can be extracted from the consensus of a group of scorers. Methods: We train two lightweight deep learning based models on three different multi-scored databases. We exploit the label smoothing technique together with a soft-consensus (LSSC) distribution to insert the multiple-knowledge in the training procedure of the model. We introduce the averaged cosine similarity metric (ACS) to quantify the similarity between the hypnodensity-graph generated by the models with-LSSC and the hypnodensity-graph generated by the scorer consensus. Results: The performance of the models improves on all the databases when we train the models with our LSSC. We found an increase in ACS (up to 6.4%) between the hypnodensity-graph generated by the models trained with-LSSC and the hypnodensity-graph generated by the consensus. Conclusions: Our approach definitely enables a model to better adapt to the consensus of the group of scorers. Future work will focus on further investigations on different scoring architectures.  ( 3 min )
    Graph Clustering with Graph Neural Networks. (arXiv:2006.16904v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. Graph clustering has the same overall goal as node pooling in GNNs - does this mean that GNN pooling methods do a good job at clusterings graphs? Surprisingly, the answer is no - current GNN pooling methods often fail to recover the cluster structure in cases where simple baselines, such as k-means applied on learned representations, work well. We investigate further by carefully designing a set of experiments to study different signal-to-noise scenarios both in graph structure and attribute data. To address these methods' poor performance in clustering, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results with over 40% improvement over other pooling methods across different metrics.  ( 3 min )
    StyleFlow For Content-Fixed Image to Image Translation. (arXiv:2207.01909v1 [cs.CV])
    Image-to-image (I2I) translation is a challenging topic in computer vision. We divide this problem into three tasks: strongly constrained translation, normally constrained translation, and weakly constrained translation. The constraint here indicates the extent to which the content or semantic information in the original image is preserved. Although previous approaches have achieved good performance in weakly constrained tasks, they failed to fully preserve the content in both strongly and normally constrained tasks, including photo-realism synthesis, style transfer, and colorization, etc. To achieve content-preserving transfer in strongly constrained and normally constrained tasks, we propose StyleFlow, a new I2I translation model that consists of normalizing flows and a novel Style-Aware Normalization (SAN) module. With the invertible network structure, StyleFlow first projects input images into deep feature space in the forward pass, while the backward pass utilizes the SAN module to perform content-fixed feature transformation and then projects back to image space. Our model supports both image-guided translation and multi-modal synthesis. We evaluate our model in several I2I translation benchmarks, and the results show that the proposed model has advantages over previous methods in both strongly constrained and normally constrained tasks.  ( 2 min )
    Sedentary Behavior Estimation with Hip-worn Accelerometer Data: Segmentation, Classification and Thresholding. (arXiv:2207.01809v1 [cs.LG])
    Cohort studies are increasingly using accelerometers for physical activity and sedentary behavior estimation. These devices tend to be less error-prone than self-report, can capture activity throughout the day, and are economical. However, previous methods for estimating sedentary behavior based on hip-worn data are often invalid or suboptimal under free-living situations and subject-to-subject variation. In this paper, we propose a local Markov switching model that takes this situation into account, and introduce a general procedure for posture classification and sedentary behavior analysis that fits the model naturally. Our method features changepoint detection methods in time series and also a two stage classification step that labels data into 3 classes(sitting, standing, stepping). Through a rigorous training-testing paradigm, we showed that our approach achieves > 80% accuracy. In addition, our method is robust and easy to interpret.  ( 2 min )
    An Approximation Method for Fitted Random Forests. (arXiv:2207.02184v1 [stat.ML])
    Random Forests (RF) is a popular machine learning method for classification and regression problems. It involves a bagging application to decision tree models. One of the primary advantages of the Random Forests model is the reduction in the variance of the forecast. In large scale applications of the model with millions of data points and hundreds of features, the size of the fitted objects can get very large and reach the limits on the available space in production setups, depending on the number and depth of the trees. This could be especially challenging when trained models need to be downloaded on-demand to small devices with limited memory. There is a need to approximate the trained RF models to significantly reduce the model size without losing too much of prediction accuracy. In this project we study methods that approximate each fitted tree in the Random Forests model using the multinomial allocation of the data points to the leafs. Specifically, we begin by studying whether fitting a multinomial logistic regression (and subsequently, a generalized additive model (GAM) extension) to the output of each tree helps reduce the size while preserving the prediction quality.  ( 2 min )
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v2 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.
    CEN : Cooperatively Evolving Networks. (arXiv:2207.02192v1 [cs.LG])
    A finitely repeated game is a dynamic game in which a simultaneous game is played finitely many times. GANs contain two competing modules: the generator module is trained to generate new examples, and the discriminator module is trained to discriminate real examples from generated examples. Training procedure of GAN is a finitely repeated game in which each module tries to optimize it's error at every instance of simultaneous game in a non-cooperative manner. We observed that we can achieve more accurate training, if at each instance of simultaneous game the stronger module cooperate with weaker module and only weaker module only optimize it's error.
    TT-PINN: A Tensor-Compressed Neural PDE Solver for Edge Computing. (arXiv:2207.01751v1 [cs.LG])
    Physics-informed neural networks (PINNs) have been increasingly employed due to their capability of modeling complex physics systems. To achieve better expressiveness, increasingly larger network sizes are required in many problems. This has caused challenges when we need to train PINNs on edge devices with limited memory, computing and energy resources. To enable training PINNs on edge devices, this paper proposes an end-to-end compressed PINN based on Tensor-Train decomposition. In solving a Helmholtz equation, our proposed model significantly outperforms the original PINNs with few parameters and achieves satisfactory prediction with up to 15$\times$ overall parameter reduction.
    NeuralPassthrough: Learned Real-Time View Synthesis for VR. (arXiv:2207.02186v1 [cs.CV])
    Virtual reality (VR) headsets provide an immersive, stereoscopic visual experience, but at the cost of blocking users from directly observing their physical environment. Passthrough techniques are intended to address this limitation by leveraging outward-facing cameras to reconstruct the images that would otherwise be seen by the user without the headset. This is inherently a real-time view synthesis challenge, since passthrough cameras cannot be physically co-located with the eyes. Existing passthrough techniques suffer from distracting reconstruction artifacts, largely due to the lack of accurate depth information (especially for near-field and disoccluded objects), and also exhibit limited image quality (e.g., being low resolution and monochromatic). In this paper, we propose the first learned passthrough method and assess its performance using a custom VR headset that contains a stereo pair of RGB cameras. Through both simulations and experiments, we demonstrate that our learned passthrough method delivers superior image quality compared to state-of-the-art methods, while meeting strict VR requirements for real-time, perspective-correct stereoscopic view synthesis over a wide field of view for desktop-connected headsets.
    Cross-Speaker Emotion Transfer for Low-Resource Text-to-Speech Using Non-Parallel Voice Conversion with Pitch-Shift Data Augmentation. (arXiv:2204.10020v2 [eess.AS] UPDATED)
    Data augmentation via voice conversion (VC) has been successfully applied to low-resource expressive text-to-speech (TTS) when only neutral data for the target speaker are available. Although the quality of VC is crucial for this approach, it is challenging to learn a stable VC model because the amount of data is limited in low-resource scenarios, and highly expressive speech has large acoustic variety. To address this issue, we propose a novel data augmentation method that combines pitch-shifting and VC techniques. Because pitch-shift data augmentation enables the coverage of a variety of pitch dynamics, it greatly stabilizes training for both VC and TTS models, even when only 1,000 utterances of the target speaker's neutral data are available. Subjective test results showed that a FastSpeech 2-based emotional TTS system with the proposed method improved naturalness and emotional similarity compared with conventional methods.
    On Effective Scheduling of Model-based Reinforcement Learning. (arXiv:2111.08550v3 [cs.LG] UPDATED)
    Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.
    Automatic inspection of cultural monuments using deep and tensor-based learning on hyperspectral imagery. (arXiv:2207.02163v1 [cs.CV])
    In Cultural Heritage, hyperspectral images are commonly used since they provide extended information regarding the optical properties of materials. Thus, the processing of such high-dimensional data becomes challenging from the perspective of machine learning techniques to be applied. In this paper, we propose a Rank-$R$ tensor-based learning model to identify and classify material defects on Cultural Heritage monuments. In contrast to conventional deep learning approaches, the proposed high order tensor-based learning demonstrates greater accuracy and robustness against overfitting. Experimental results on real-world data from UNESCO protected areas indicate the superiority of the proposed scheme compared to conventional deep learning models.
    DBN-Mix: Training Dual Branch Network Using Bilateral Mixup Augmentation for Long-Tailed Visual Recognition. (arXiv:2207.02173v1 [cs.CV])
    There is a growing interest in the challenging visual perception task of learning from long-tailed class distributions. The extreme class imbalance in the training dataset biases the model to prefer to recognize majority-class data over minority-class data. Recently, the dual branch network (DBN) framework has been proposed, where two branch networks; the conventional branch and the re-balancing branch were employed to improve the accuracy of long-tailed visual recognition. The re-balancing branch uses a reverse sampler to generate class-balanced training samples to mitigate bias due to class imbalance. Although this strategy has been quite successful in handling bias, using a reversed sampler for training can degrade the representation learning performance. To alleviate this issue, the conventional method used a carefully designed cumulative learning strategy, in which the influence of the re-balancing branch gradually increases throughout the entire training phase. In this study, we aim to develop a simple yet effective method to improve the performance of DBN without cumulative learning that is difficult to optimize. We devise a simple data augmentation method termed bilateral mixup augmentation, which combines one sample from the uniform sampler with another sample from the reversed sampler to produce a training sample. Furthermore, we present class-conditional temperature scaling that mitigates bias toward the majority class for the proposed DBN architecture. Our experiments performed on widely used long-tailed visual recognition datasets show that bilateral mixup augmentation is quite effective in improving the representation learning performance of DBNs, and that the proposed method achieves state-of-the-art performance for some categories.
    Convolutional Filtering and Neural Networks with Non Commutative Algebras. (arXiv:2108.09923v2 [cs.LG] UPDATED)
    In this paper we provide stability results for algebraic neural networks (AlgNNs) based on non commutative algebras. AlgNNs are stacked layered structures with each layer associated to an algebraic signal model (ASM) determined by an algebra, a vector space, and a homomorphism. Signals are modeled as elements of the vector space, filters are elements in the algebra, while the homomorphism provides a realization of the filters as concrete operators. We study the stability of the algebraic filters in non commutative algebras to perturbations on the homomorphisms, and we provide conditions under which stability is guaranteed. We show that the commutativity between shift operators and between shifts and perturbations does not affect the property of an architecture of being stable. This provides an answer to the question of whether shift invariance was a necessary attribute of convolutional architectures to guarantee stability. Additionally, we show that although the frequency responses of filters in non commutative algebras exhibit substantial differences with respect to filters in commutative algebras, their derivatives for stable filters have a similar behavior.
    Path Integral Stochastic Optimal Control for Sampling Transition Paths. (arXiv:2207.02149v1 [q-bio.BM])
    We consider the problem of Sampling Transition Paths. Given two metastable conformational states of a molecular system, eg. a folded and unfolded protein, we aim to sample the most likely transition path between the two states. Sampling such a transition path is computationally expensive due to the existence of high free energy barriers between the two states. To circumvent this, previous work has focused on simplifying the trajectories to occur along specific molecular descriptors called Collective Variables (CVs). However, finding CVs is not trivial and requires chemical intuition. For larger molecules, where intuition is not sufficient, using these CV-based methods biases the transition along possibly irrelevant dimensions. Instead, this work proposes a method for sampling transition paths that consider the entire geometry of the molecules. To achieve this, we first relate the problem to recent work on the Schrodinger bridge problem and stochastic optimal control. Using this relation, we construct a method that takes into account important characteristics of molecular systems such as second-order dynamics and invariance to rotations and translations. We demonstrate our method on the commonly studied Alanine Dipeptide, but also consider larger proteins such as Polyproline and Chignolin.
    A survey of multimodal deep generative models. (arXiv:2207.02127v1 [cs.LG])
    Multimodal learning is a framework for building models that make predictions based on different types of modalities. Important challenges in multimodal learning are the inference of shared representations from arbitrary modalities and cross-modal generation via these representations; however, achieving this requires taking the heterogeneous nature of multimodal data into account. In recent years, deep generative models, i.e., generative models in which distributions are parameterized by deep neural networks, have attracted much attention, especially variational autoencoders, which are suitable for accomplishing the above challenges because they can consider heterogeneity and infer good representations of data. Therefore, various multimodal generative models based on variational autoencoders, called multimodal deep generative models, have been proposed in recent years. In this paper, we provide a categorized survey of studies on multimodal deep generative models.
    Discovering Quantum Phase Transitions with Fermionic Neural Networks. (arXiv:2202.05183v3 [physics.comp-ph] UPDATED)
    Deep neural networks have been extremely successful as highly accurate wave function ans\"atze for variational Monte Carlo calculations of molecular ground states. We present an extension of one such ansatz, FermiNet, to calculations of the ground states of periodic Hamiltonians, and study the homogeneous electron gas. FermiNet calculations of the ground-state energies of small electron gas systems are in excellent agreement with previous initiator full configuration interaction quantum Monte Carlo and diffusion Monte Carlo calculations. We investigate the spin-polarized homogeneous electron gas and demonstrate that the same neural network architecture is capable of accurately representing both the delocalized Fermi liquid state and the localized Wigner crystal state. The network is given no \emph{a priori} knowledge that a phase transition exists, but converges on the translationally invariant ground state at high density and spontaneously breaks the symmetry to produce the crystalline ground state at low density.
    A Safe Semi-supervised Graph Convolution Network. (arXiv:2207.01960v1 [cs.LG])
    In the semi-supervised learning field, Graph Convolution Network (GCN), as a variant model of GNN, has achieved promising results for non-Euclidean data by introducing convolution into GNN. However, GCN and its variant models fail to safely use the information of risk unlabeled data, which will degrade the performance of semi-supervised learning. Therefore, we propose a Safe GCN framework (Safe-GCN) to improve the learning performance. In the Safe-GCN, we design an iterative process to label the unlabeled data. In each iteration, a GCN and its supervised version(S-GCN) are learned to find the unlabeled data with high confidence. The high-confidence unlabeled data and their pseudo labels are then added to the label set. Finally, both added unlabeled data and labeled ones are used to train a S-GCN which can achieve the safe exploration of the risk unlabeled data and enable safe use of large numbers of unlabeled data. The performance of Safe-GCN is evaluated on three well-known citation network datasets and the obtained results demonstrate the effectiveness of the proposed framework over several graph-based semi-supervised learning methods.
    Rethinking Attention-Model Explainability through Faithfulness Violation Test. (arXiv:2201.12114v3 [cs.LG] UPDATED)
    Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. However, in this paper, we find one critical limitation in attention explanations: weakness in identifying the polarity of feature impact. This would be somehow misleading -- features with higher attention weights may not faithfully contribute to model predictions; instead, they can impose suppression effects. With this finding, we reflect on the explainability of current attention-based techniques, such as Attentio$\odot$Gradient and LRP-based attention explanations. We first propose an actionable diagnostic methodology (henceforth faithfulness violation test) to measure the consistency between explanation weights and the impact polarity. Through the extensive experiments, we then show that most tested explanation methods are unexpectedly hindered by the faithfulness violation issue, especially the raw attention. Empirical analyses on the factors affecting violation issues further provide useful observations for adopting explanation methods in attention models.
    A Cross-City Federated Transfer Learning Framework: A Case Study on Urban Region Profiling. (arXiv:2206.00007v2 [cs.LG] UPDATED)
    Data insufficiency problem (i.e., data missing and label scarcity issues) caused by inadequate services and infrastructures or unbalanced development levels of cities has seriously affected the urban computing tasks in real scenarios. Prior transfer learning methods inspire an elegant solution to the data insufficiency, but are only concerned with one kind of insufficiency issue and fail to give consideration to both sides. In addition, most previous cross-city transfer methods overlooks the inter-city data privacy which is a public concern in practical application. To address above challenging problems, we propose a novel Cross-city Federated Transfer Learning framework (CcFTL) to cope with the data insufficiency and privacy problems. Concretely, CcFTL transfers the relational knowledge from multiple rich-data source cities to the target city. Besides, the model parameters specific to the target task are firstly trained on the source data and then fine-tuned to the target city by parameter transfer. With our adaptation of federated training and homomorphic encryption settings, CcFTL can effectively deal with the data privacy problem among cities. We take the urban region profiling as an application of smart cities and evaluate the proposed method with a real-world study. The experiments demonstrate the notable superiority of our framework over several competitive state-of-the-art models.
    PRoA: A Probabilistic Robustness Assessment against Functional Perturbations. (arXiv:2207.02036v1 [cs.LG])
    In safety-critical deep learning applications robustness measurement is a vital pre-deployment phase. However, existing robustness verification methods are not sufficiently practical for deploying machine learning systems in the real world. On the one hand, these methods attempt to claim that no perturbations can ``fool'' deep neural networks (DNNs), which may be too stringent in practice. On the other hand, existing works rigorously consider $L_p$ bounded additive perturbations on the pixel space, although perturbations, such as colour shifting and geometric transformations, are more practically and frequently occurring in the real world. Thus, from the practical standpoint, we present a novel and general {\it probabilistic robustness assessment method} (PRoA) based on the adaptive concentration, and it can measure the robustness of deep learning models against functional perturbations. PRoA can provide statistical guarantees on the probabilistic robustness of a model, \textit{i.e.}, the probability of failure encountered by the trained model after deployment. Our experiments demonstrate the effectiveness and flexibility of PRoA in terms of evaluating the probabilistic robustness against a broad range of functional perturbations, and PRoA can scale well to various large-scale deep neural networks compared to existing state-of-the-art baselines. For the purpose of reproducibility, we release our tool on GitHub: \url{ https://github.com/TrustAI/PRoA}.
    FedTune: Automatic Tuning of Federated Learning Hyper-Parameters from System Perspective. (arXiv:2110.03061v5 [cs.LG] UPDATED)
    Federated learning (FL) hyper-parameters significantly affect the training overheads in terms of computation time, transmission time, computation load, and transmission load. However, the current practice of manually selecting FL hyper-parameters puts a high burden on FL practitioners since various applications prefer different training preferences. In this paper, we propose FedTune, an automatic FL hyper-parameter tuning algorithm tailored to applications' diverse system requirements of FL training. FedTune is lightweight and flexible, achieving 8.48%-26.75% improvement for different datasets compared to fixed FL hyper-parameters.
    FedChain: Chained Algorithms for Near-Optimal Communication Cost in Federated Learning. (arXiv:2108.06869v4 [cs.LG] UPDATED)
    Federated learning (FL) aims to minimize the communication complexity of training a model over heterogeneous data distributed across many clients. A common approach is local methods, where clients take multiple optimization steps over local data before communicating with the server (e.g., FedAvg). Local methods can exploit similarity between clients' data. However, in existing analyses, this comes at the cost of slow convergence in terms of the dependence on the number of communication rounds R. On the other hand, global methods, where clients simply return a gradient vector in each round (e.g., SGD), converge faster in terms of R but fail to exploit the similarity between clients even when clients are homogeneous. We propose FedChain, an algorithmic framework that combines the strengths of local methods and global methods to achieve fast convergence in terms of R while leveraging the similarity between clients. Using FedChain, we instantiate algorithms that improve upon previously known rates in the general convex and PL settings, and are near-optimal (via an algorithm-independent lower bound that we show) for problems that satisfy strong convexity. Empirical results support this theoretical gain over existing methods.
    Evaluation of Semantic Answer Similarity Metrics. (arXiv:2206.12664v2 [cs.CL] UPDATED)
    There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.
    The optimal reservoir computer for nonlinear dynamics. (arXiv:2202.05159v2 [cs.LG] UPDATED)
    Analysis and prediction of real-world complex systems of nonlinear dynamics relies largely on surrogate models. Reservoir computers (RC) have proven useful in replicating the climate of chaotic dynamics. The quality of surrogate models based on RCs is crucially dependent on judiciously determined optimal implementation that involves selecting optimal reservoir topology and hyperparameters. By systematically applying Bayesian hyperparameter optimization and using ensembles of reservoirs of various topology we show that the topology of linked reservoirs has no significance in forecasting dynamics of the chaotic Lorenz system. By simulations we show that simple reservoirs of unconnected nodes outperform reservoirs of linked reservoirs as surrogate models for the Lorenz system in different regimes. We give a derivation for why reservoirs of unconnected nodes have the maximum entropy and hence are optimal. We conclude that the performance of an RC is based on mere functional transformation, not in its dynamical properties as has been generally presumed. Hence, RC could be improved by including information on dynamics more strongly in the model.
    Data-Dependent Randomized Smoothing. (arXiv:2012.04351v4 [cs.LG] UPDATED)
    Randomized smoothing is a recent technique that achieves state-of-art performance in training certifiably robust deep neural networks. While the smoothing family of distributions is often connected to the choice of the norm used for certification, the parameters of these distributions are always set as global hyper parameters independent from the input data on which a network is certified. In this work, we revisit Gaussian randomized smoothing and show that the variance of the Gaussian distribution can be optimized at each input so as to maximize the certification radius for the construction of the smooth classifier. Since the data dependent classifier does not directly enjoy sound certification with existing approaches, we propose a memory-enhanced data dependent smooth classifier that is certifiable by construction. This new approach is generic, parameter-free, and easy to implement. In fact, we show that our data dependent framework can be seamlessly incorporated into 3 randomized smoothing approaches, leading to consistent improved certified accuracy. When this framework is used in the training routine of these approaches followed by a data dependent certification, we achieve 9% and 6% improvement over the certified accuracy of the strongest baseline for a radius of 0.5 on CIFAR10 and ImageNet.
    No-Regret Learning in Partially-Informed Auctions. (arXiv:2202.10606v2 [cs.LG] UPDATED)
    Auctions with partially-revealed information about items are broadly employed in real-world applications, but the underlying mechanisms have limited theoretical support. In this work, we study a machine learning formulation of these types of mechanisms, presenting algorithms that are no-regret from the buyer's perspective. Specifically, a buyer who wishes to maximize his utility interacts repeatedly with a platform over a series of $T$ rounds. In each round, a new item is drawn from an unknown distribution and the platform publishes a price together with incomplete, "masked" information about the item. The buyer then decides whether to purchase the item. We formalize this problem as an online learning task where the goal is to have low regret with respect to a myopic oracle that has perfect knowledge of the distribution over items and the seller's masking function. When the distribution over items is known to the buyer and the mask is a SimHash function mapping $\mathbb{R}^d$ to $\{0,1\}^{\ell}$, our algorithm has regret $\tilde O((Td\ell)^{1/2})$. In a fully agnostic setting when the mask is an arbitrary function mapping to a set of size $n$ and the prices are stochastic, our algorithm has regret $\tilde O((Tn)^{1/2})$.
    On the Nash equilibrium of moment-matching GANs for stationary Gaussian processes. (arXiv:2203.07136v2 [stat.ML] UPDATED)
    Generative Adversarial Networks (GANs) learn an implicit generative model from data samples through a two-player game. In this paper, we study the existence of Nash equilibrium of the game which is consistent as the number of data samples grows to infinity. In a realizable setting where the goal is to estimate the ground-truth generator of a stationary Gaussian process, we show that the existence of consistent Nash equilibrium depends crucially on the choice of the discriminator family. The discriminator defined from second-order statistical moments can result in non-existence of Nash equilibrium, existence of consistent non-Nash equilibrium, or existence and uniqueness of consistent Nash equilibrium, depending on whether symmetry properties of the generator family are respected. We further study the local stability and global convergence of gradient descent-ascent methods towards consistent equilibrium.
    A Generative Framework for Personalized Learning and Estimation: Theory, Algorithms, and Privacy. (arXiv:2207.01771v1 [cs.LG])
    A distinguishing characteristic of federated learning is that the (local) client data could have statistical heterogeneity. This heterogeneity has motivated the design of personalized learning, where individual (personalized) models are trained, through collaboration. There have been various personalization methods proposed in literature, with seemingly very different forms and methods ranging from use of a single global model for local regularization and model interpolation, to use of multiple global models for personalized clustering, etc. In this work, we begin with a generative framework that could potentially unify several different algorithms as well as suggest new algorithms. We apply our generative framework to personalized estimation, and connect it to the classical empirical Bayes' methodology. We develop private personalized estimation under this framework. We then use our generative framework for learning, which unifies several known personalized FL algorithms and also suggests new ones; we propose and study a new algorithm AdaPeD based on a Knowledge Distillation, which numerically outperforms several known algorithms. We also develop privacy for personalized learning methods with guarantees for user-level privacy and composition. We numerically evaluate the performance as well as the privacy for both the estimation and learning problems, demonstrating the advantages of our proposed methods.
    Lane-GNN: Integrating GNN for Predicting Drivers' Lane Change Intention. (arXiv:2207.00824v2 [cs.LG] UPDATED)
    Nowadays, intelligent highway traffic network is playing an important role in modern transportation infrastructures. A variable speed limit (VSL) system can be facilitated in the highway traffic network to provide useful and dynamic speed limit information for drivers to travel with enhanced safety. Such system is usually designed with a steady advisory speed in mind so that traffic can move smoothly when drivers follow the speed, rather than speeding up whenever there is a gap and slowing down at congestion. However, little attention has been given to the research of vehicles' behaviours when drivers left the road network governed by a VSL system, which may largely involve unexpected acceleration, deceleration and frequent lane changes, resulting in chaos for the subsequent highway road users. In this paper, we focus on the detection of traffic flow anomaly due to drivers' lane change intention on the highway traffic networks after a VSL system. More specifically, we apply graph modelling on the traffic flow data generated by a popular mobility simulator, SUMO, at road segment levels. We then evaluate the performance of lane changing detection using the proposed Lane-GNN scheme, an attention temporal graph convolutional neural network, and compare its performance with a temporal convolutional neural network (TCNN) as our baseline. Our experimental results show that the proposed Lane-GNN can detect drivers' lane change intention within 90 seconds with an accuracy of 99.42% under certain assumptions. Finally, some interpretation methods are applied to the trained models with a view to further illustrate our findings.
    Minimax Estimation of Linear Functions of Eigenvectors in the Face of Small Eigen-Gaps. (arXiv:2104.03298v2 [math.ST] UPDATED)
    Eigenvector perturbation analysis plays a vital role in various data science applications. A large body of prior works, however, focused on establishing $\ell_{2}$ eigenvector perturbation bounds, which are often highly inadequate in addressing tasks that rely on fine-grained behavior of an eigenvector. This paper makes progress on this by studying the perturbation of linear functions of an unknown eigenvector. Focusing on two fundamental problems -- matrix denoising and principal component analysis -- in the presence of Gaussian noise, we develop a suite of statistical theory that characterizes the perturbation of arbitrary linear functions of an unknown eigenvector. In order to mitigate a non-negligible bias issue inherent to the natural ``plug-in'' estimator, we develop de-biased estimators that (1) achieve minimax lower bounds for a family of scenarios (modulo some logarithmic factor), and (2) can be computed in a data-driven manner without sample splitting. Noteworthily, the proposed estimators are nearly minimax optimal even when the associated eigen-gap is {\em substantially smaller} than what is required in prior statistical theory.
    Neural Network Gaussian Processes by Increasing Depth. (arXiv:2108.12862v3 [cs.LG] UPDATED)
    Recent years have witnessed an increasing interest in the correspondence between infinitely wide networks and Gaussian processes. Despite the effectiveness and elegance of the current neural network Gaussian process theory, to the best of our knowledge, all the neural network Gaussian processes are essentially induced by increasing width. However, in the era of deep learning, what concerns us more regarding a neural network is its depth as well as how depth impacts the behaviors of a network. Inspired by a width-depth symmetry consideration, we use a shortcut network to show that increasing the depth of a neural network can also give rise to a Gaussian process, which is a valuable addition to the existing theory and contributes to revealing the true picture of deep learning. Beyond the proposed Gaussian process by depth, we theoretically characterize its uniform tightness property and the smallest eigenvalue of the Gaussian process kernel. These characterizations can not only enhance our understanding of the proposed depth-induced Gaussian process but also pave the way for future applications. Lastly, we examine the performance of the proposed Gaussian process by regression experiments on two benchmark data sets.
    Progressive Subsampling for Oversampled Data -- Application to Quantitative MRI. (arXiv:2203.09268v3 [eess.IV] UPDATED)
    We present PROSUB: PROgressive SUBsampling, a deep learning based, automated methodology that subsamples an oversampled data set (e.g. multi-channeled 3D images) with minimal loss of information. We build upon a recent dual-network approach that won the MICCAI MUlti-DIffusion (MUDI) quantitative MRI measurement sampling-reconstruction challenge, but suffers from deep learning training instability, by subsampling with a hard decision boundary. PROSUB uses the paradigm of recursive feature elimination (RFE) and progressively subsamples measurements during deep learning training, improving optimization stability. PROSUB also integrates a neural architecture search (NAS) paradigm, allowing the network architecture hyperparameters to respond to the subsampling process. We show PROSUB outperforms the winner of the MUDI MICCAI challenge, producing large improvements >18% MSE on the MUDI challenge sub-tasks and qualitative improvements on downstream processes useful for clinical applications. We also show the benefits of incorporating NAS and analyze the effect of PROSUB's components. As our method generalizes to other problems beyond MRI measurement selection-reconstruction, our code is https://github.com/sbb-gh/PROSUB
    Learning Optimal Transport Between two Empirical Distributions with Normalizing Flows. (arXiv:2207.01246v2 [cs.LG] UPDATED)
    Optimal transport (OT) provides effective tools for comparing and mapping probability measures. We propose to leverage the flexibility of neural networks to learn an approximate optimal transport map. More precisely, we present a new and original method to address the problem of transporting a finite set of samples associated with a first underlying unknown distribution towards another finite set of samples drawn from another unknown distribution. We show that a particular instance of invertible neural networks, namely the normalizing flows, can be used to approximate the solution of this OT problem between a pair of empirical distributions. To this aim, we propose to relax the Monge formulation of OT by replacing the equality constraint on the push-forward measure by the minimization of the corresponding Wasserstein distance. The push-forward operator to be retrieved is then restricted to be a normalizing flow which is trained by optimizing the resulting cost function. This approach allows the transport map to be discretized as a composition of functions. Each of these functions is associated to one sub-flow of the network, whose output provides intermediate steps of the transport between the original and target measures. This discretization yields also a set of intermediate barycenters between the two measures of interest. Experiments conducted on toy examples as well as a challenging task of unsupervised translation demonstrate the interest of the proposed method. Finally, some experiments show that the proposed approach leads to a good approximation of the true OT.
    A Deep Learning Approach for the solution of Probability Density Evolution of Stochastic Systems. (arXiv:2207.01907v1 [cs.LG])
    Derivation of the probability density evolution provides invaluable insight into the behavior of many stochastic systems and their performance. However, for most real-time applica-tions, numerical determination of the probability density evolution is a formidable task. The latter is due to the required temporal and spatial discretization schemes that render most computational solutions prohibitive and impractical. In this respect, the development of an efficient computational surrogate model is of paramount importance. Recent studies on the physics-constrained networks show that a suitable surrogate can be achieved by encoding the physical insight into a deep neural network. To this aim, the present work introduces DeepPDEM which utilizes the concept of physics-informed networks to solve the evolution of the probability density via proposing a deep learning method. DeepPDEM learns the General Density Evolution Equation (GDEE) of stochastic structures. This approach paves the way for a mesh-free learning method that can solve the density evolution problem with-out prior simulation data. Moreover, it can also serve as an efficient surrogate for the solu-tion at any other spatiotemporal points within optimization schemes or real-time applica-tions. To demonstrate the potential applicability of the proposed framework, two network architectures with different activation functions as well as two optimizers are investigated. Numerical implementation on three different problems verifies the accuracy and efficacy of the proposed method.
    DiffML: End-to-end Differentiable ML Pipelines. (arXiv:2207.01269v2 [cs.DB] UPDATED)
    In this paper, we present our vision of differentiable ML pipelines called DiffML to automate the construction of ML pipelines in an end-to-end fashion. The idea is that DiffML allows to jointly train not just the ML model itself but also the entire pipeline including data preprocessing steps, e.g., data cleaning, feature selection, etc. Our core idea is to formulate all pipeline steps in a differentiable way such that the entire pipeline can be trained using backpropagation. However, this is a non-trivial problem and opens up many new research questions. To show the feasibility of this direction, we demonstrate initial ideas and a general principle of how typical preprocessing steps such as data cleaning, feature selection and dataset selection can be formulated as differentiable programs and jointly learned with the ML model. Moreover, we discuss a research roadmap and core challenges that have to be systematically tackled to enable fully differentiable ML pipelines.
    An adaptive music generation architecture for games based on the deep learning Transformer mode. (arXiv:2207.01698v1 [cs.SD])
    This paper presents an architecture for generating music for video games based on the Transformer deep learning model. The system generates music in various layers, following the standard layering strategy currently used by composers designing video game music. The music is adaptive to the psychological context of the player, according to the arousal-valence model. Our motivation is to customize music according to the player's tastes, who can select his preferred style of music through a set of training examples of music. We discuss current limitations and prospects for the future, such as collaborative and interactive control of the musical components.
    A Probabilistic State Space Model for Joint Inference from Differential Equations and Data. (arXiv:2103.10153v3 [stat.ML] UPDATED)
    Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak.
    An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge. (arXiv:2205.14775v2 [stat.ML] UPDATED)
    We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
    Features Based Adaptive Augmentation for Graph Contrastive Learning. (arXiv:2207.01792v1 [cs.LG])
    Self-Supervised learning aims to eliminate the need for expensive annotation in graph representation learning, where graph contrastive learning (GCL) is trained with the self-supervision signals containing data-data pairs. These data-data pairs are generated with augmentation employing stochastic functions on the original graph. We argue that some features can be more critical than others depending on the downstream task, and applying stochastic function uniformly, will vandalize the influential features, leading to diminished accuracy. To fix this issue, we introduce a Feature Based Adaptive Augmentation (FebAA) approach, which identifies and preserves potentially influential features and corrupts the remaining ones. We implement FebAA as plug and play layer and use it with state-of-the-art Deep Graph Contrastive Learning (GRACE) and Bootstrapped Graph Latents (BGRL). We successfully improved the accuracy of GRACE and BGRL on eight graph representation learning's benchmark datasets.
    A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems. (arXiv:2010.15768v2 [math.OC] UPDATED)
    Nonconvex-concave min-max problem arises in many machine learning applications including minimizing a pointwise maximum of a set of nonconvex functions and robust adversarial training of neural networks. A popular approach to solve this problem is the gradient descent-ascent (GDA) algorithm which unfortunately can exhibit oscillation in case of nonconvexity. In this paper, we introduce a "smoothing" scheme which can be combined with GDA to stabilize the oscillation and ensure convergence to a stationary solution. We prove that the stabilized GDA algorithm can achieve an $O(1/\epsilon^2)$ iteration complexity for minimizing the pointwise maximum of a finite collection of nonconvex functions. Moreover, the smoothed GDA algorithm achieves an $O(1/\epsilon^4)$ iteration complexity for general nonconvex-concave problems. Extensions of this stabilized GDA algorithm to multi-block cases are presented. To the best of our knowledge, this is the first algorithm to achieve $O(1/\epsilon^2)$ for a class of nonconvex-concave problem. We illustrate the practical efficiency of the stabilized GDA algorithm on robust training.
    Explainability in Deep Reinforcement Learning, a Review into Current Methods and Applications. (arXiv:2207.01911v1 [cs.LG])
    The use of Deep Reinforcement Learning (DRL) schemes has increased dramatically since their first introduction in 2015. Though uses in many different applications are being found they still have a problem with the lack of interpretability. This has bread a lack of understanding and trust in the use of DRL solutions from researchers and the general public. To solve this problem the field of explainable artificial intelligence (XAI) has emerged. This is a variety of different methods that look to open the DRL black boxes, they range from the use of interpretable symbolic decision trees to numerical methods like Shapley Values. This review looks at which methods are being used and what applications they are being used. This is done to identify which models are the best suited to each application or if a method is being underutilised.
    UniCR: Universally Approximated Certified Robustness via Randomized Smoothing. (arXiv:2207.02152v1 [cs.LG])
    We study certified robustness of machine learning classifiers against adversarial perturbations. In particular, we propose the first universally approximated certified robustness (UniCR) framework, which can approximate the robustness certification of any input on any classifier against any $\ell_p$ perturbations with noise generated by any continuous probability distribution. Compared with the state-of-the-art certified defenses, UniCR provides many significant benefits: (1) the first universal robustness certification framework for the above 4 'any's; (2) automatic robustness certification that avoids case-by-case analysis, (3) tightness validation of certified robustness, and (4) optimality validation of noise distributions used by randomized smoothing. We conduct extensive experiments to validate the above benefits of UniCR and the advantages of UniCR over state-of-the-art certified defenses against $\ell_p$ perturbations.
    Compactness Score: A Fast Filter Method for Unsupervised Feature Selection. (arXiv:2201.13194v2 [cs.LG] UPDATED)
    Along with the flourish of the information age, massive amounts of data are generated day by day. Due to the large-scale and high-dimensional characteristics of these data, it is often difficult to achieve better decision-making in practical applications. Therefore, an efficient big data analytics method is urgently needed. For feature engineering, feature selection seems to be an important research content in which is anticipated to select "excellent" features from candidate ones. Different functions can be realized through feature selection, such as dimensionality reduction, model effect improvement, and model performance improvement. In many classification tasks, researchers found that data seem to be usually close to each other if they are from the same class; thus, local compactness is of great importance for the evaluation of a feature. In this manuscript, we propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS), to select desired features. To demonstrate the efficiency and accuracy, several data sets are chosen with extensive experiments being performed. Later, the effectiveness and superiority of our method are revealed through addressing clustering tasks. Here, the performance is indicated by several well-known evaluation metrics, while the efficiency is reflected by the corresponding running time. As revealed by the simulation results, our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
    Bayesian NVH metamodels to assess interior cabin noise using measurement databases. (arXiv:2207.02120v1 [stat.AP])
    In recent years, a great emphasis has been put on engineering the acoustic signature of vehicles that represents the overall comfort level for passengers. Due to highly uncertain behavior of production cars, probabilistic metamodels or surrogates can be useful to estimate the NVH dispersion and assess different NVH risks. These metamodels follow physical behaviors and shall aid as a design space exploration tool during the early stage design process to support the NVH optimization. The measurement databases constitute different noise paths such as aerodynamic noise (wind-tunnel test), tire-pavement interaction noise (rolling noise), and noise due to electric motors (whining noise). This research work proposes a global NVH metamodeling technique for broadband noises such as aerodynamic and rolling noises exploiting the Bayesian framework that takes into account the prior (domain-expert) knowledge about complex physical mechanisms. Generalized additive models (GAMs) with polynomials and Gaussian basis functions are used to model the dependency of sound pressure level (SPL) on predictor variables. Moreover, parametric bootstrap algorithm based on data-generating mechanism using the point estimates is used to estimate the dispersion in unknown parameters. Probabilistic modelling is carried out using an open-source library PyMC3 that utilizes No-U-Turn sampler (NUTS) and the developed models are validated using Cross-Validation technique.
    Near out-of-distribution detection for low-resolution radar micro-Doppler signatures. (arXiv:2205.07869v2 [eess.SP] UPDATED)
    Near out-of-distribution detection (OODD) aims at discriminating semantically similar data points without the supervision required for classification. This paper puts forward an OODD use case for radar targets detection extensible to other kinds of sensors and detection scenarios. We emphasize the relevance of OODD and its specific supervision requirements for the detection of a multimodal, diverse targets class among other similar radar targets and clutter in real-life critical systems. We propose a comparison of deep and non-deep OODD methods on simulated low-resolution pulse radar micro-Doppler signatures, considering both a spectral and a covariance matrix input representation. The covariance representation aims at estimating whether dedicated second-order processing is appropriate to discriminate signatures. The potential contributions of labeled anomalies in training, self-supervised learning, contrastive learning insights and innovative training losses are discussed, and the impact of training set contamination caused by mislabelling is investigated.
    Conflicting Interactions Among Protections Mechanisms for Machine Learning Models. (arXiv:2207.01991v1 [cs.LG])
    Nowadays, systems based on machine learning (ML) are widely used in different domains. Given their popularity, ML models have become targets for various attacks. As a result, research at the intersection of security and privacy, and ML has flourished. The research community has been exploring the attack vectors and potential mitigations separately. However, practitioners will likely need to deploy defences against several threats simultaneously. A solution that is optimal for a specific concern may interact negatively with solutions intended to address other concerns. In this work, we explore the potential for conflicting interactions between different solutions that enhance the security/privacy of ML-base systems. We focus on model and data ownership; exploring how ownership verification techniques interact with other ML security/privacy techniques like differentially private training, and robustness against model evasion. We provide a framework, and conduct systematic analysis of pairwise interactions. We show that many pairs are incompatible. Where possible, we provide relaxations to the hyperparameters or the techniques themselves that allow for the simultaneous deployment. Lastly, we discuss the implications and provide guidelines for future work.
    Towards trustworthy Energy Disaggregation: A review of challenges, methods and perspectives for Non-Intrusive Load Monitoring. (arXiv:2207.02009v1 [cs.LG])
    Non-intrusive load monitoring (NILM) is the task of disaggregating the total power consumption into its individual sub-components. Over the years, signal processing and machine learning algorithms have been combined to achieve this. A lot of publications and extensive research works are performed on energy disaggregation or NILM for the state-of-the-art methods to reach on the desirable performance. The initial interest of the scientific community to formulate and describe mathematically the NILM problem using machine learning tools has now shifted into a more practical NILM. Nowadays, we are in the mature NILM period where there is an attempt for NILM to be applied in real-life application scenarios. Thus, complexity of the algorithms, transferability, reliability, practicality and in general trustworthiness are the main issues of interest. This review narrows the gap between the early immature NILM era and the mature one. In particular, the paper provides a comprehensive literature review of the NILM methods for residential appliances only. The paper analyzes, summarizes and presents the outcomes of a large number of recently published scholarly articles. Also, the paper discusses the highlights of these methods and introduces the research dilemmas that should be taken into consideration by researchers to apply NILM methods. Finally, we show the need for transferring the traditional disaggregation models into a practical and trustworthy framework.
    PLATINUM: Semi-Supervised Model Agnostic Meta-Learning using Submodular Mutual Information. (arXiv:2201.12928v2 [cs.LG] UPDATED)
    Few-shot classification (FSC) requires training models using a few (typically one to five) data points per class. Meta learning has proven to be able to learn a parametrized model for FSC by training on various other classification tasks. In this work, we propose PLATINUM (semi-suPervised modeL Agnostic meTa-learnIng usiNg sUbmodular Mutual information), a novel semi-supervised model agnostic meta-learning framework that uses the submodular mutual information (SMI) functions to boost the performance of FSC. PLATINUM leverages unlabeled data in the inner and outer loop using SMI functions during meta-training and obtains richer meta-learned parameterizations for meta-test. We study the performance of PLATINUM in two scenarios - 1) where the unlabeled data points belong to the same set of classes as the labeled set of a certain episode, and 2) where there exist out-of-distribution classes that do not belong to the labeled set. We evaluate our method on various settings on the miniImageNet, tieredImageNet and Fewshot-CIFAR100 datasets. Our experiments show that PLATINUM outperforms MAML and semi-supervised approaches like pseduo-labeling for semi-supervised FSC, especially for small ratio of labeled examples per class.
    On the Efficiency of Subclass Knowledge Distillation in Classification Tasks. (arXiv:2109.05587v3 [cs.LG] UPDATED)
    This work introduces a novel knowledge distillation framework for classification tasks where information on existing subclasses is available and taken into consideration. In classification tasks with a small number of classes or binary detection (two classes) the amount of information transferred from the teacher to the student network is restricted, thus limiting the utility of knowledge distillation. Performance can be improved by leveraging information about possible subclasses within the available classes in the classification task. To that end, we propose the so-called Subclass Knowledge Distillation (SKD) framework, which is the process of transferring the subclasses' prediction knowledge from a large teacher model into a smaller student one. Through SKD, additional meaningful information which is not in the teacher's class logits but exists in subclasses (e.g., similarities inside classes) will be conveyed to the student and boost its performance. Mathematically, we measure how many extra information bits the teacher can provide for the student via SKD framework. The framework developed is evaluated in clinical application, namely colorectal polyp binary classification. In this application, clinician-provided annotations are used to define subclasses based on the annotation label's variability in a curriculum style of learning. A lightweight, low complexity student trained with the proposed framework achieves an F1-score of 85.05%, an improvement of 2.14% and 1.49% gain over the student that trains without and with conventional knowledge distillation, respectively. These results show that the extra subclasses' knowledge (i.e., 0.4656 label bits per training sample in our experiment) can provide more information about the teacher generalization, and therefore SKD can benefit from using more information to increase the student performance.
    Multimodal Frame-Scoring Transformer for Video Summarization. (arXiv:2207.01814v1 [cs.LG])
    As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator and multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST) framework exploiting visual, text and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (visual-text-audio) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses video-text-audio representations as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method.
    Deriving Surface Resistivity from Polarimetric SAR Data Using Dual-Input UNet. (arXiv:2207.01811v1 [physics.geo-ph])
    Traditional survey methods for finding surface resistivity are time-consuming and labor intensive. Very few studies have focused on finding the resistivity/conductivity using remote sensing data and deep learning techniques. In this line of work, we assessed the correlation between surface resistivity and Synthetic Aperture Radar (SAR) by applying various deep learning methods and tested our hypothesis in the Coso Geothermal Area, USA. For detecting the resistivity, L-band full polarimetric SAR data acquired by UAVSAR were used, and MT (Magnetotellurics) inverted resistivity data of the area were used as the ground truth. We conducted experiments to compare various deep learning architectures and suggest the use of Dual Input UNet (DI-UNet) architecture. DI-UNet uses a deep learning architecture to predict the resistivity using full polarimetric SAR data by promising a quick survey addition to the traditional method. Our proposed approach accomplished improved outcomes for the mapping of MT resistivity from SAR data.
    Defending against the Label-flipping Attack in Federated Learning. (arXiv:2207.01982v1 [cs.CR])
    Federated learning (FL) provides autonomy and privacy by design to participating peers, who cooperatively build a machine learning (ML) model while keeping their private data in their devices. However, that same autonomy opens the door for malicious peers to poison the model by conducting either untargeted or targeted poisoning attacks. The label-flipping (LF) attack is a targeted poisoning attack where the attackers poison their training data by flipping the labels of some examples from one class (i.e., the source class) to another (i.e., the target class). Unfortunately, this attack is easy to perform and hard to detect and it negatively impacts on the performance of the global model. Existing defenses against LF are limited by assumptions on the distribution of the peers' data and/or do not perform well with high-dimensional models. In this paper, we deeply investigate the LF attack behavior and find that the contradicting objectives of attackers and honest peers on the source class examples are reflected in the parameter gradients corresponding to the neurons of the source and target classes in the output layer, making those gradients good discriminative features for the attack detection. Accordingly, we propose a novel defense that first dynamically extracts those gradients from the peers' local updates, and then clusters the extracted gradients, analyzes the resulting clusters and filters out potential bad updates before model aggregation. Extensive empirical analysis on three data sets shows the proposed defense's effectiveness against the LF attack regardless of the data distribution or model dimensionality. Also, the proposed defense outperforms several state-of-the-art defenses by offering lower test error, higher overall accuracy, higher source class accuracy, lower attack success rate, and higher stability of the source class accuracy.
    Disentangling private classes through regularization. (arXiv:2207.02000v1 [cs.LG])
    Deep learning models are nowadays broadly deployed to solve an incredibly large variety of tasks. However, little attention has been devoted to connected legal aspects. In 2016, the European Union approved the General Data Protection Regulation which entered into force in 2018. Its main rationale was to protect the privacy and data protection of its citizens by the way of operating of the so-called "Data Economy". As data is the fuel of modern Artificial Intelligence, it is argued that the GDPR can be partly applicable to a series of algorithmic decision making tasks before a more structured AI Regulation enters into force. In the meantime, AI should not allow undesired information leakage deviating from the purpose for which is created. In this work we propose DisP, an approach for deep learning models disentangling the information related to some classes we desire to keep private, from the data processed by AI. In particular, DisP is a regularization strategy de-correlating the features belonging to the same private class at training time, hiding the information of private classes membership. Our experiments on state-of-the-art deep learning models show the effectiveness of DisP, minimizing the risk of extraction for the classes we desire to keep private.
    Ask-AC: An Initiative Advisor-in-the-Loop Actor-Critic Framework. (arXiv:2207.01955v1 [cs.LG])
    Despite the promising results achieved, state-of-the-art interactive reinforcement learning schemes rely on passively receiving supervision signals from advisor experts, in the form of either continuous monitoring or pre-defined rules, which inevitably result in a cumbersome and expensive learning process. In this paper, we introduce a novel initiative advisor-in-the-loop actor-critic framework, termed as Ask-AC, that replaces the unilateral advisor-guidance mechanism with a bidirectional learner-initiative one, and thereby enables a customized and efficacious message exchange between learner and advisor. At the heart of Ask-AC are two complementary components, namely action requester and adaptive state selector, that can be readily incorporated into various discrete actor-critic architectures. The former component allows the agent to initiatively seek advisor intervention in the presence of uncertain states, while the latter identifies the unstable states potentially missed by the former especially when environment changes, and then learns to promote the ask action on such states. Experimental results on both stationary and non-stationary environments and across different actor-critic backbones demonstrate that the proposed framework significantly improves the learning efficiency of the agent, and achieves the performances on par with those obtained by continuous advisor monitoring.
    Network Support for High-performance Distributed Machine Learning. (arXiv:2102.03394v2 [cs.NI] UPDATED)
    The traditional approach to distributed machine learning is to adapt learning algorithms to the network, e.g., reducing updates to curb overhead. Networks based on intelligent edge, instead, make it possible to follow the opposite approach, i.e., to define the logical network topology em around the learning task to perform, so as to meet the desired learning performance. In this paper, we propose a system model that captures such aspects in the context of supervised machine learning, accounting for both learning nodes (that perform computations) and information nodes (that provide data). We then formulate the problem of selecting (i) which learning and information nodes should cooperate to complete the learning task, and (ii) the number of iterations to perform, in order to minimize the learning cost while meeting the target prediction error and execution time. After proving important properties of the above problem, we devise an algorithm, named DoubleClimb, that can find a 1+1/|I|-competitive solution (with I being the set of information nodes), with cubic worst-case complexity. Our performance evaluation, leveraging a real-world network topology and considering both classification and regression tasks, also shows that DoubleClimb closely matches the optimum, outperforming state-of-the-art alternatives.  ( 3 min )
    VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees. (arXiv:2112.00334v3 [cs.LG] UPDATED)
    Bagging and boosting are two popular ensemble methods in machine learning (ML) that produce many individual decision trees. Due to the inherent ensemble characteristic of these methods, they typically outperform single decision trees or other ML models in predictive performance. However, numerous decision paths are generated for each decision tree, increasing the overall complexity of the model and hindering its use in domains that require trustworthy and explainable decisions, such as finance, social care, and health care. Thus, the interpretability of bagging and boosting algorithms, such as random forest and adaptive boosting, reduces as the number of decisions rises. In this paper, we propose a visual analytics tool that aims to assist users in extracting decisions from such ML models via a thorough visual inspection workflow that includes selecting a set of robust and diverse models (originating from different ensemble learning algorithms), choosing important features according to their global contribution, and deciding which decisions are essential for global explanation (or locally, for specific cases). The outcome is a final decision based on the class agreement of several models and the explored manual decisions exported by users. We evaluated the applicability and effectiveness of VisRuler via a use case, a usage scenario, and a user study. The evaluation revealed that most users managed to successfully use our system to explore decision rules visually, performing the proposed tasks and answering the given questions in a satisfying way.
    Image Amodal Completion: A Survey. (arXiv:2207.02062v1 [cs.CV])
    Existing computer vision systems can compete with humans in understanding the visible parts of objects, but still fall far short of humans when it comes to depicting the invisible parts of partially occluded objects. Image amodal completion aims to equip computers with human-like amodal completion functions to understand an intact object despite it being partially occluded. The main purpose of this survey is to provide an intuitive understanding of the research hotspots, key technologies and future trends in the field of image amodal completion. Firstly, we present a comprehensive review of the latest literature in this emerging field, exploring three key tasks in image amodal completion, including amodal shape completion, amodal appearance completion, and order perception. Then we examine popular datasets related to image amodal completion along with their common data collection methods and evaluation metrics. Finally, we discuss real-world applications and future research directions for image amodal completion, facilitating the reader's understanding of the challenges of existing technologies and upcoming research trends.  ( 2 min )
    Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality. (arXiv:2207.02119v1 [cs.CV])
    Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. In this paper, we systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer. Existing orthogonal treatments on the weights are first investigated. However, these techniques can improve the conditioning but would hurt the performance. To avoid such a side effect, we propose the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of our methods is validated in two applications: decorrelated Batch Normalization (BN) and Global Covariance Pooling (GCP). Extensive experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization. Moreover, the combinations with orthogonal weight can further boost the performances.  ( 2 min )
    One-Shot Transfer Learning of Physics-Informed Neural Networks. (arXiv:2110.11286v2 [cs.LG] UPDATED)
    Solving differential equations efficiently and accurately sits at the heart of progress in many areas of scientific research, from classical dynamical systems to quantum mechanics. There is a surge of interest in using Physics-Informed Neural Networks (PINNs) to tackle such problems as they provide numerous benefits over traditional numerical approaches. Despite their potential benefits for solving differential equations, transfer learning has been under explored. In this study, we present a general framework for transfer learning PINNs that results in one-shot inference for linear systems of both ordinary and partial differential equations. This means that highly accurate solutions to many unknown differential equations can be obtained instantaneously without retraining an entire network. We demonstrate the efficacy of the proposed deep learning approach by solving several real-world problems, such as first- and second-order linear ordinary equations, the Poisson equation, and the time-dependent Schrodinger complex-value partial differential equation.
    A Boosting Algorithm for Positive-Unlabeled Learning. (arXiv:2205.09485v2 [cs.LG] UPDATED)
    Positive-unlabeled (PU) learning deals with binary classification problems when only positive (P) and unlabeled (U) data are available. A lot of PU methods based on linear models and neural networks have been proposed; however, there still lacks study on how the theoretically sound boosting-style algorithms could work with P and U data. Considering that in some scenarios when neural networks cannot perform as good as boosting algorithms even with fully-supervised data, we propose a novel boosting algorithm for PU learning: Ada-PU, which compares against neural networks. Ada-PU follows the general procedure of AdaBoost while two different distributions of P data are maintained and updated. After a weak classifier is learned on the newly updated distribution, the corresponding combining weight for the final ensemble is estimated using only PU data. We demonstrated that with a smaller set of base classifiers, the proposed method is guaranteed to keep the theoretical properties of boosting algorithms. In experiments, we showed that Ada-PU outperforms neural networks on benchmark PU datasets. We also study a real-world dataset UNSW-NB15 in cyber security and demonstrated that Ada-PU has superior performance for malicious activity detection.
    Degree-Based Random Walk Approach for Graph Embedding. (arXiv:2110.13627v2 [cs.SI] UPDATED)
    Graph embedding, representing local and global neighborhood information by numerical vectors, is a crucial part of the mathematical modeling of a wide range of real-world systems. Among the embedding algorithms, random walk-based algorithms have proven to be very successful. These algorithms collect information by creating numerous random walks with a redefined number of steps. Creating random walks is the most demanding part of the embedding process. The computation demand increases with the size of the network. Moreover, for real-world networks, considering all nodes on the same footing, the abundance of low-degree nodes creates an imbalanced data problem. In this work, a computationally less intensive and node connectivity aware uniform sampling method is proposed. In the proposed method, the number of random walks is created proportionally with the degree of the node. The advantages of the proposed algorithm become more enhanced when the algorithm is applied to large graphs. A comparative study by using two networks namely CORA and CiteSeer is presented. Comparing with the fixed number of walks case, the proposed method requires 50% less computational effort to reach the same accuracy for node classification and link prediction calculations.
    "Even if ..." -- Diverse Semifactual Explanations of Reject. (arXiv:2207.01898v1 [cs.LG])
    Machine learning based decision making systems applied in safety critical areas require reliable high certainty predictions. For this purpose, the system can be extended by an reject option which allows the system to reject inputs where only a prediction with an unacceptably low certainty would be possible. While being able to reject uncertain samples is important, it is also of importance to be able to explain why a particular sample was rejected. With the ongoing rise of eXplainable AI (XAI), a lot of explanation methodologies for machine learning based systems have been developed -- explaining reject options, however, is still a novel field where only very little prior work exists. In this work, we propose to explain rejects by semifactual explanations, an instance of example-based explanation methods, which them self have not been widely considered in the XAI community yet. We propose a conceptual modeling of semifactual explanations for arbitrary reject options and empirically evaluate a specific implementation on a conformal prediction based reject option.
    An Empirical Study of Language Model Integration for Transducer based Speech Recognition. (arXiv:2203.16776v3 [eess.AS] UPDATED)
    Utilizing text-only data with an external language model (ELM) in end-to-end RNN-Transducer (RNN-T) for speech recognition is challenging. Recently, a class of methods such as density ratio (DR) and internal language model estimation (ILME) have been developed, outperforming the classic shallow fusion (SF) method. The basic idea behind these methods is that RNN-T posterior should first subtract the implicitly learned internal language model (ILM) prior, in order to integrate the ELM. While recent studies suggest that RNN-T only learns some low-order language model information, the DR method uses a well-trained neural language model with full context, which may be inappropriate for the estimation of ILM and deteriorate the integration performance. Based on the DR method, we propose a low-order density ratio method (LODR) by replacing the estimation with a low-order weak language model. Extensive empirical experiments are conducted on both in-domain and cross-domain scenarios on English LibriSpeech & Tedlium-2 and Chinese WenetSpeech & AISHELL-1 datasets. It is shown that LODR consistently outperforms SF in all tasks, while performing generally close to ILME and better than DR in most tests.
    Bayesian approaches for Quantifying Clinicians' Variability in Medical Image Quantification. (arXiv:2207.01868v1 [eess.IV])
    Medical imaging, including MRI, CT, and Ultrasound, plays a vital role in clinical decisions. Accurate segmentation is essential to measure the structure of interest from the image. However, manual segmentation is highly operator-dependent, which leads to high inter and intra-variability of quantitative measurements. In this paper, we explore the feasibility that Bayesian predictive distribution parameterized by deep neural networks can capture the clinicians' inter-intra variability. By exploring and analyzing recently emerged approximate inference schemes, we evaluate whether approximate Bayesian deep learning with the posterior over segmentations can learn inter-intra rater variability both in segmentation and clinical measurements. The experiments are performed with two different imaging modalities: MRI and ultrasound. We empirically demonstrated that Bayesian predictive distribution parameterized by deep neural networks could approximate the clinicians' inter-intra variability. We show a new perspective in analyzing medical images quantitatively by providing clinical measurement uncertainty.
    Application of multilayer perceptron with data augmentation in nuclear physics. (arXiv:2205.07953v2 [cs.LG] UPDATED)
    Neural networks have become popular in many fields of science since they serve as promising, reliable and powerful tools. In this work, we study the effect of data augmentation on the predictive power of neural network models for nuclear physics data. We present two different data augmentation techniques, and we conduct a detailed analysis in terms of different depths, optimizers, activation functions and random seed values to show the success and robustness of the model. Using the experimental uncertainties for data augmentation for the first time, the size of the training data set is artificially boosted and the changes in the root-mean-square error between the model predictions on the test set and the experimental data are investigated. Our results show that the data augmentation decreases the prediction errors, stabilizes the model and prevents overfitting. The extrapolation capabilities of the MLP models are also tested for newly measured nuclei in AME2020 mass table, and it is shown that the predictions are significantly improved by using data augmentation.
    Fidelity of Ensemble Aggregation for Saliency Map Explanations using Bayesian Optimization Techniques. (arXiv:2207.01565v2 [cs.CV] UPDATED)
    In recent years, an abundance of feature attribution methods for explaining neural networks have been developed. Especially in the field of computer vision, many methods for generating saliency maps providing pixel attributions exist. However, their explanations often contradict each other and it is not clear which explanation to trust. A natural solution to this problem is the aggregation of multiple explanations. We present and compare different pixel-based aggregation schemes with the goal of generating a new explanation, whose fidelity to the model's decision is higher than each individual explanation. Using methods from the field of Bayesian Optimization, we incorporate the variance between the individual explanations into the aggregation process. Additionally, we analyze the effect of multiple normalization techniques on ensemble aggregation.
    Resource Allocation in Multicore Elastic Optical Networks: A Deep Reinforcement Learning Approach. (arXiv:2207.02074v1 [cs.LG])
    A deep reinforcement learning approach is applied, for the first time, to solve the routing, modulation, spectrum and core allocation (RMSCA) problem in dynamic multicore fiber elastic optical networks (MCF-EONs). To do so, a new environment - compatible with OpenAI's Gym - was designed and implemented to emulate the operation of MCF-EONs. The new environment processes the agent actions (selection of route, core and spectrum slot) by considering the network state and physical-layer-related aspects. The latter includes the available modulation formats and their reach and the inter-core crosstalk (XT), an MCF-related impairment. If the resulting quality of the signal is acceptable, the environment allocates the resources selected by the agent. After processing the agent's action, the environment is configured to give the agent a numerical reward and information about the new network state. The blocking performance of four different agents was compared through simulation to 3 baseline heuristics used in MCF-EONs. Results obtained for the NSFNet and COST239 network topologies show that the best-performing agent achieves, on average, up to a four-times decrease in blocking probability concerning the best-performing baseline heuristic methods.  ( 2 min )
    Content Addressable Memory Without Catastrophic Forgetting by Heteroassociation with a Fixed Scaffold. (arXiv:2202.00159v3 [cs.AI] UPDATED)
    Content-addressable memory (CAM) networks, so-called because stored items can be recalled by partial or corrupted versions of the items, exhibit near-perfect recall of a small number of information-dense patterns below capacity and a 'memory cliff' beyond, such that inserting a single additional pattern results in catastrophic loss of all stored patterns. We propose a novel CAM architecture, Memory Scaffold with Heteroassociation (MESH), that factorizes the problems of internal attractor dynamics and association with external content to generate a CAM continuum without a memory cliff: Small numbers of patterns are stored with complete information recovery matching standard CAMs, while inserting more patterns still results in partial recall of every pattern, with a graceful trade-off between pattern number and pattern richness. Motivated by the architecture of the Entorhinal-Hippocampal memory circuit in the brain, MESH is a tripartite architecture with pairwise interactions that uses a predetermined set of internally stabilized states together with heteroassociation between the internal states and arbitrary external patterns. We show analytically and experimentally that for any number of stored patterns, MESH nearly saturates the total information bound (given by the number of synapses) for CAM networks, outperforming all existing CAM models.
    QuPeD: Quantized Personalization via Distillation with Applications to Federated Learning. (arXiv:2107.13892v2 [cs.LG] UPDATED)
    Traditionally, federated learning (FL) aims to train a single global model while collaboratively using multiple clients and a server. Two natural challenges that FL algorithms face are heterogeneity in data across clients and collaboration of clients with {\em diverse resources}. In this work, we introduce a \textit{quantized} and \textit{personalized} FL algorithm QuPeD that facilitates collective (personalized model compression) training via \textit{knowledge distillation} (KD) among clients who have access to heterogeneous data and resources. For personalization, we allow clients to learn \textit{compressed personalized models} with different quantization parameters and model dimensions/structures. Towards this, first we propose an algorithm for learning quantized models through a relaxed optimization problem, where quantization values are also optimized over. When each client participating in the (federated) learning process has different requirements for the compressed model (both in model dimension and precision), we formulate a compressed personalization framework by introducing knowledge distillation loss for local client objectives collaborating through a global model. We develop an alternating proximal gradient update for solving this compressed personalization problem, and analyze its convergence properties. Numerically, we validate that QuPeD outperforms competing personalized FL methods, FedAvg, and local training of clients in various heterogeneous settings.
    Local Multi-Label Explanations for Random Forest. (arXiv:2207.01994v1 [cs.LG])
    Multi-label classification is a challenging task, particularly in domains where the number of labels to be predicted is large. Deep neural networks are often effective at multi-label classification of images and textual data. When dealing with tabular data, however, conventional machine learning algorithms, such as tree ensembles, appear to outperform competition. Random forest, being a popular ensemble algorithm, has found use in a wide range of real-world problems. Such problems include fraud detection in the financial domain, crime hotspot detection in the legal sector, and in the biomedical field, disease probability prediction when patient records are accessible. Since they have an impact on people's lives, these domains usually require decision-making systems to be explainable. Random Forest falls short on this property, especially when a large number of tree predictors are used. This issue was addressed in a recent research named LionForests, regarding single label classification and regression. In this work, we adapt this technique to multi-label classification problems, by employing three different strategies regarding the labels that the explanation covers. Finally, we provide a set of qualitative and quantitative experiments to assess the efficacy of this approach.
    Deterministic Decoupling of Global Features and its Application to Data Analysis. (arXiv:2207.02132v1 [cs.LG])
    We introduce a method for deterministic decoupling of global features and show its applicability to improve data analysis performance, as well as to open new venues for feature transfer. We propose a new formalism that is based on defining transformations on submanifolds, by following trajectories along the features gradients. Through these transformations we define a normalization that, we demonstrate, allows for decoupling differentiable features. By applying this to sampling moments, we obtain a quasi-analytic solution for the orthokurtosis, a normalized version of the kurtosis that is not just decoupled from mean and variance, but also from skewness. We apply this method in the original data domain and at the output of a filter bank to regression and classification problems based on global descriptors, obtaining a consistent and significant improvement in performance as compared to using classical (non-decoupled) descriptors.  ( 2 min )
    Modeling and Correcting Bias in Sequential Evaluation. (arXiv:2205.01607v2 [stat.ML] UPDATED)
    We consider the problem of sequential evaluation, in which an evaluator observes candidates in a sequence and assigns scores to these candidates in an online, irrevocable fashion. Motivated by the psychology literature that has studied sequential bias in such settings -- namely, dependencies between the evaluation outcome and the order in which the candidates appear -- we propose a natural model for the evaluator's rating process that captures the lack of calibration inherent to such a task. We conduct crowdsourcing experiments to demonstrate various facets of our model. We then proceed to study how to correct sequential bias under our model by posing this as a statistical inference problem. We propose a near-linear time, online algorithm for this task and prove guarantees in terms of two canonical ranking metrics. We also prove that our algorithm is information theoretically optimal, by establishing matching lower bounds in both metrics. Finally, we show that our algorithm outperforms the de facto method of using the rankings induced by the reported scores.
    Entity Linking in Tabular Data Needs the Right Attention. (arXiv:2207.01937v1 [cs.CL])
    Understanding the semantic meaning of tabular data requires Entity Linking (EL), in order to associate each cell value to a real-world entity in a Knowledge Base (KB). In this work, we focus on end-to-end solutions for EL on tabular data that do not rely on fact lookup in the target KB. Tabular data contains heterogeneous and sparse context, including column headers, cell values and table captions. We experiment with various models to generate a vector representation for each cell value to be linked. Our results show that it is critical to apply an attention mechanism as well as an attention mask, so that the model can only attend to the most relevant context and avoid information dilution. The most relevant context includes: same-row cells, same-column cells, headers and caption. Computational complexity, however, grows quadratically with the size of tabular data for such a complex model. We achieve constant memory usage by introducing a Tabular Entity Linking Lite model (TELL ) that generates vector representation for a cell based only on its value, the table headers and the table caption. TELL achieves 80.8% accuracy on Wikipedia tables, which is only 0.1% lower than the state-of-the-art model with quadratic memory usage.  ( 2 min )
    A Causal Approach for Business Optimization: Application on an Online Marketplace. (arXiv:2207.01722v1 [cs.LG])
    A common sales strategy involves having account executives (AEs) actively reach out and contact potential customers. However, not all contact attempts have a positive effect: some attempts do not change customer decisions, while others might even interfere with the desired outcome. In this work we propose using causal inference to estimate the effect of contacting each potential customer and setting the contact policy accordingly. We demonstrate this approach on data from Worthy.com, an online jewelry marketplace. We examined the Worthy business process to identify relevant decisions and outcomes, and formalized assumptions on how they were made. Using causal tools, we selected a decision point where improving AE contact activity appeared to be promising. We then generated a personalized policy and recommended reaching out only to customers for whom it would be beneficial. Finally, we validated the results in an A\B test over a 3-month period, resulting in an increase in item delivery rate of the targeted population by 22% (p-value=0.026). This policy is now being used on an ongoing basis.
    Insights into the origin of halo mass profiles from machine learning. (arXiv:2205.04474v2 [astro-ph.CO] UPDATED)
    The mass distribution of dark matter haloes is the result of the hierarchical growth of initial density perturbations through mass accretion and mergers. We use an interpretable machine-learning framework to provide physical insights into the origin of the spherically-averaged mass profile of dark matter haloes. We train a gradient-boosted-trees algorithm to predict the final mass profiles of cluster-sized haloes, and measure the importance of the different inputs provided to the algorithm. We find two primary scales in the initial conditions (ICs) that impact the final mass profile: the density at approximately the scale of the haloes' Lagrangian patch $R_L$ ($R\sim 0.7\, R_L$) and that in the large-scale environment ($R\sim 1.7~R_L$). The model also identifies three primary time-scales in the halo assembly history that affect the final profile: (i) the formation time of the virialized, collapsed material inside the halo, (ii) the dynamical time, which captures the dynamically unrelaxed, infalling component of the halo over its first orbit, (iii) a third, most recent time-scale, which captures the impact on the outer profile of recent massive merger events. While the inner profile retains memory of the ICs, this information alone is insufficient to yield accurate predictions for the outer profile. As we add information about the haloes' mass accretion history, we find a significant improvement in the predicted profiles at all radii. Our machine-learning framework provides novel insights into the role of the ICs and the mass assembly history in determining the final mass profile of cluster-sized haloes.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v1 [cs.LG])
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of online label shift (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal dynamic regret, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.  ( 2 min )
    Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization. (arXiv:2207.01789v1 [math.OC])
    We consider minimizing a twice-differentiable, $L$-smooth, and $\mu$-strongly convex objective $\phi$ over an $n\times n$ positive semidefinite matrix $M\succeq0$, under the assumption that the minimizer $M^{\star}$ has low rank $r^{\star}\ll n$. Following the Burer--Monteiro approach, we instead minimize the nonconvex objective $f(X)=\phi(XX^{T})$ over a factor matrix $X$ of size $n\times r$. This substantially reduces the number of variables from $O(n^{2})$ to as few as $O(n)$ and also enforces positive semidefiniteness for free, but at the cost of giving up the convexity of the original problem. In this paper, we prove that if the search rank $r\ge r^{\star}$ is overparameterized by a constant factor with respect to the true rank $r^{\star}$, namely as in $r>\frac{1}{4}(L/\mu-1)^{2}r^{\star}$, then despite nonconvexity, local optimization is guaranteed to globally converge from any initial point to the global optimum. This significantly improves upon a previous rank overparameterization threshold of $r\ge n$, which is known to be sharp if $\phi$ is allowed to be nonsmooth and/or non-strongly convex, but would increase the number of variables back up to $O(n^{2})$. Conversely, without rank overparameterization, we prove that such a global guarantee is possible if and only if $\phi$ is almost perfectly conditioned, with a condition number of $L/\mu<3$. Therefore, we conclude that a small amount of overparameterization can lead to large improvements in theoretical guarantees for the nonconvex Burer--Monteiro factorization.  ( 3 min )
    A Neural Tangent Kernel Perspective of GANs. (arXiv:2106.05566v4 [cs.LG] UPDATED)
    We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We reveal a fundamental flaw of previous analyses which, by incorrectly modeling GANs' training scheme, are subject to ill-defined discriminator gradients. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network. From this, we derive new insights about the convergence of the generated distribution, advancing our understanding of GANs' training dynamics. We empirically corroborate these results via an analysis toolkit based on our framework, unveiling intuitions that are consistent with GAN practice.  ( 3 min )
    Approximating Discontinuous Nash Equilibrial Values of Two-Player General-Sum Differential Games. (arXiv:2207.01773v1 [cs.LG])
    Finding Nash equilibrial policies for two-player differential games requires solving Hamilton-Jacobi-Isaacs PDEs. Recent studies achieved success in circumventing the curse of dimensionality in solving such PDEs with underlying applications to human-robot interactions (HRI), by adopting self-supervised (physics-informed) neural networks as universal value approximators. This paper extends from previous SOTA on zero-sum games with continuous values to general-sum games with discontinuous values, where the discontinuity is caused by that of the players' losses. We show that due to its lack of convergence proof and generalization analysis on discontinuous losses, the existing self-supervised learning technique fails to generalize and raises safety concerns in an autonomous driving application. Our solution is to first pre-train the value network on supervised Nash equilibria, and then refine it by minimizing a loss that combines the supervised data with the PDE and boundary conditions. Importantly, the demonstrated advantage of the proposed learning method against purely supervised and self-supervised approaches requires careful choice of the neural activation function: Among $\texttt{relu}$, $\texttt{sin}$, and $\texttt{tanh}$, we show that $\texttt{tanh}$ is the only choice that achieves optimal generalization and safety performance. Our conjecture is that $\texttt{tanh}$ (similar to $\texttt{sin}$) allows continuity of value and its gradient, which is sufficient for the convergence of learning, and at the same time is expressive enough (similar to $\texttt{relu}$) at approximating discontinuous value landscapes. Lastly, we apply our method to approximating control policies for an incomplete-information interaction and demonstrate its contribution to safe interactions.  ( 3 min )
    Correlation between entropy and generalizability in a neural network. (arXiv:2207.01996v1 [cond-mat.stat-mech])
    Although neural networks can solve very complex machine-learning problems, the theoretical reason for their generalizability is still not fully understood. Here we use Wang-Landau Mote Carlo algorithm to calculate the entropy (logarithm of the volume of a part of the parameter space) at a given test accuracy, and a given training loss function value or training accuracy. Our results show that entropical forces help generalizability. Although our study is on a very simple application of neural networks (a spiral dataset and a small, fully-connected neural network), our approach should be useful in explaining the generalizability of more complicated neural networks in future works.  ( 2 min )
    Unsupervised Crowdsourcing with Accuracy and Cost Guarantees. (arXiv:2207.01988v1 [cs.LG])
    We consider the problem of cost-optimal utilization of a crowdsourcing platform for binary, unsupervised classification of a collection of items, given a prescribed error threshold. Workers on the crowdsourcing platform are assumed to be divided into multiple classes, based on their skill, experience, and/or past performance. We model each worker class via an unknown confusion matrix, and a (known) price to be paid per label prediction. For this setting, we propose algorithms for acquiring label predictions from workers, and for inferring the true labels of items. We prove that if the number of (unlabeled) items available is large enough, our algorithms satisfy the prescribed error thresholds, incurring a cost that is near-optimal. Finally, we validate our algorithms, and some heuristics inspired by them, through an extensive case study.  ( 2 min )
    Learning to Accelerate Approximate Methods for Solving Integer Programming via Early Fixing. (arXiv:2207.02087v1 [cs.DM])
    Integer programming (IP) is an important and challenging problem. Approximate methods have shown promising performance on both effectiveness and efficiency for solving the IP problem. However, we observed that a large fraction of variables solved by some iterative approximate methods fluctuate around their final converged discrete states in very long iterations. Inspired by this observation, we aim to accelerate these approximate methods by early fixing these fluctuated variables to their converged states while not significantly harming the solution accuracy. To this end, we propose an early fixing framework along with the approximate method. We formulate the whole early fixing process as a Markov decision process, and train it using imitation learning. A policy network will evaluate the posterior probability of each free variable concerning its discrete candidate states in each block of iterations. Specifically, we adopt the powerful multi-headed attention mechanism in the policy network. Extensive experiments on our proposed early fixing framework are conducted to three different IP applications: constrained linear programming, MRF energy minimization and sparse adversarial attack. The former one is linear IP problem, while the latter two are quadratic IP problems. We extend the problem scale from regular size to significantly large size. The extensive experiments reveal the competitiveness of our early fixing framework: the runtime speeds up significantly, while the solution quality does not degrade much, even in some cases it is available to obtain better solutions. Our proposed early fixing framework can be regarded as an acceleration extension of ADMM methods for solving integer programming. The source codes are available at \url{https://github.com/SCLBD/Accelerated-Lpbox-ADMM}.  ( 3 min )
    Meta-Learning a Real-Time Tabular AutoML Method For Small Data. (arXiv:2207.01848v1 [cs.LG])
    We present TabPFN, an AutoML method that is competitive with the state of the art on small tabular datasets while being over 1,000$\times$ faster. Our method is very simple: it is fully entailed in the weights of a single neural network, and a single forward pass directly yields predictions for a new dataset. Our AutoML method is meta-learned using the Transformer-based Prior-Data Fitted Network (PFN) architecture and approximates Bayesian inference with a prior that is based on assumptions of simplicity and causal structures. The prior contains a large space of structural causal models and Bayesian neural networks with a bias for small architectures and thus low complexity. Furthermore, we extend the PFN approach to differentiably calibrate the prior's hyperparameters on real data. By doing so, we separate our abstract prior assumptions from their heuristic calibration on real data. Afterwards, the calibrated hyperparameters are fixed and TabPFN can be applied to any new tabular dataset at the push of a button. Finally, on 30 datasets from the OpenML-CC18 suite we show that our method outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with predictions produced in less than a second. We provide all our code and our final trained TabPFN in the supplementary materials.  ( 2 min )
    Learning Matchable Image Transformations for Long-term Metric Visual Localization. (arXiv:1904.01080v5 [cs.CV] UPDATED)
    Long-term metric self-localization is an essential capability of autonomous mobile robots, but remains challenging for vision-based systems due to appearance changes caused by lighting, weather, or seasonal variations. While experience-based mapping has proven to be an effective technique for bridging the `appearance gap,' the number of experiences required for reliable metric localization over days or months can be very large, and methods for reducing the necessary number of experiences are needed for this approach to scale. Taking inspiration from color constancy theory, we learn a nonlinear RGB-to-grayscale mapping that explicitly maximizes the number of inlier feature matches for images captured under different lighting and weather conditions, and use it as a pre-processing step in a conventional single-experience localization pipeline to improve its robustness to appearance change. We train this mapping by approximating the target non-differentiable localization pipeline with a deep neural network, and find that incorporating a learned low-dimensional context feature can further improve cross-appearance feature matching. Using synthetic and real-world datasets, we demonstrate substantial improvements in localization performance across day-night cycles, enabling continuous metric localization over a 30-hour period using a single mapping experience, and allowing experience-based localization to scale to long deployments with dramatically reduced data requirements.  ( 3 min )
    Recent Deep Semi-supervised Learning Approaches and Related Works. (arXiv:2106.11528v2 [cs.LG] UPDATED)
    The author of this work proposes an overview of the recent semi-supervised learning approaches and related works. Despite the remarkable success of neural networks in various applications, there exist few formidable constraints including the need for a large amount of labeled data. Therefore, semi-supervised learning, which is a learning scheme in which the scarce labels and a larger amount of unlabeled data are utilized to train models (e.g., deep neural networks) is getting more important. Based on the key assumptions of semi-supervised learning, which are the manifold assumption, cluster assumption, and continuity assumption, the work reviews the recent semi-supervised learning approaches. In particular, the methods in regard to using deep neural networks in a semi-supervised learning setting are primarily discussed. In addition, the existing works are first classified based on the underlying idea and explained, and then the holistic approaches that unify the aforementioned ideas are detailed.  ( 2 min )
    Neural Networks and the Chomsky Hierarchy. (arXiv:2207.02098v1 [cs.LG])
    Reliable generalization lies at the heart of safe ML and AI. However, understanding when and how neural networks generalize remains one of the most important unsolved problems in the field. In this work, we conduct an extensive empirical study (2200 models, 16 tasks) to investigate whether insights from the theory of computation can predict the limits of neural network generalization in practice. We demonstrate that grouping tasks according to the Chomsky hierarchy allows us to forecast whether certain architectures will be able to generalize to out-of-distribution inputs. This includes negative results where even extensive amounts of data and training time never led to any non-trivial generalization, despite models having sufficient capacity to perfectly fit the training data. Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, LSTMs can solve regular and counter-language tasks, and only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks.  ( 2 min )
    Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons. (arXiv:2107.02397v6 [cs.LG] UPDATED)
    This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple and computable continuous activation function $\sigma$ leveraging a triangular-wave function and the softsign function. We prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the continuous function space $C([a,b]^d)$ and therefore dense in the Lebesgue spaces $L^p([a,b]^d)$ for $p\in [1,\infty)$. Furthermore, classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$, when there exist pairwise disjoint bounded closed subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset. Finally, we use numerical experimentation to show that replacing the ReLU activation function by ours would improve the experiment results.  ( 3 min )
    ICE-NODE: Integration of Clinical Embeddings with Neural Ordinary Differential Equations. (arXiv:2207.01873v1 [cs.LG])
    Early diagnosis of disease can result in improved health outcomes, such as higher survival rates and lower treatment costs. With the massive amount of information in electronic health records (EHRs), there is great potential to use machine learning (ML) methods to model disease progression aimed at early prediction of disease onset and other outcomes. In this work, we employ recent innovations in neural ODEs to harness the full temporal information of EHRs. We propose ICE-NODE (Integration of Clinical Embeddings with Neural Ordinary Differential Equations), an architecture that temporally integrates embeddings of clinical codes and neural ODEs to learn and predict patient trajectories in EHRs. We apply our method to the publicly available MIMIC-III and MIMIC-IV datasets, reporting improved prediction results compared to state-of-the-art methods, specifically for clinical codes that are not frequently observed in EHRs. We also show that ICE-NODE is more competent at predicting certain medical conditions, like acute renal failure and pulmonary heart disease, and is also able to produce patient risk trajectories over time that can be exploited for further predictions.  ( 2 min )
    ST-CoNAL: Consistency-Based Acquisition Criterion Using Temporal Self-Ensemble for Active Learning. (arXiv:2207.02182v1 [cs.CV])
    Modern deep learning has achieved great success in various fields. However, it requires the labeling of huge amounts of data, which is expensive and labor-intensive. Active learning (AL), which identifies the most informative samples to be labeled, is becoming increasingly important to maximize the efficiency of the training process. The existing AL methods mostly use only a single final fixed model for acquiring the samples to be labeled. This strategy may not be good enough in that the structural uncertainty of a model for given training data is not considered to acquire the samples. In this study, we propose a novel acquisition criterion based on temporal self-ensemble generated by conventional stochastic gradient descent (SGD) optimization. These self-ensemble models are obtained by capturing the intermediate network weights obtained through SGD iterations. Our acquisition function relies on a consistency measure between the student and teacher models. The student models are given a fixed number of temporal self-ensemble models, and the teacher model is constructed by averaging the weights of the student models. Using the proposed acquisition criterion, we present an AL algorithm, namely student-teacher consistency-based AL (ST-CoNAL). Experiments conducted for image classification tasks on CIFAR-10, CIFAR-100, Caltech-256, and Tiny ImageNet datasets demonstrate that the proposed ST-CoNAL achieves significantly better performance than the existing acquisition methods. Furthermore, extensive experiments show the robustness and effectiveness of our methods.  ( 3 min )
    CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations. (arXiv:2207.02185v1 [cs.CV])
    Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on object-matching) from different environments. Our environment agnostic visual representation can mitigate the environment bias induced by low-level visual information. Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation. Furthermore, we show that our learned language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task, and present detailed qualitative and quantitative generalization and grounding analysis. Our code is available at https://github.com/jialuli-luka/CLEAR  ( 3 min )
    Federated Phish Bowl: LSTM-Based Decentralized Phishing Email Detection. (arXiv:2110.06025v2 [cs.CR] UPDATED)
    With increasingly more sophisticated phishing campaigns in recent years, phishing emails lure people using more legitimate-looking personal contexts. To tackle this problem, instead of traditional heuristics-based algorithms, more adaptive detection systems such as natural language processing (NLP)-powered approaches are essential to understanding phishing text representations. Nevertheless, concerns surrounding the collection of phishing data that might cover confidential information hinder the effectiveness of model learning. We propose a decentralized phishing email detection framework called Federated Phish Bowl (FedPB) which facilitates collaborative phishing detection with privacy. In particular, we devise a knowledge-sharing mechanism with federated learning (FL). Using long short-term memory (LSTM) for phishing detection, the framework adapts by sharing a global word embedding matrix across the clients, with each client running its local model with Non-IID data. We collected the most recent phishing samples to study the effectiveness of the proposed method using different client numbers and data distributions. The results show that FedPB can attain a competitive performance with a centralized phishing detector, with generality to various cases of FL retaining a prediction accuracy of 83%.  ( 2 min )
    The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks. (arXiv:2110.06296v2 [cs.LG] UPDATED)
    In this paper, we conjecture that if the permutation invariance of neural networks is taken into account, SGD solutions will likely have no barrier in the linear interpolation between them. Although it is a bold conjecture, we show how extensive empirical attempts fall short of refuting it. We further provide a preliminary theoretical result to support our conjecture. Our conjecture has implications for lottery ticket hypothesis, distributed training, and ensemble methods.  ( 2 min )
    Offline RL Policies Should be Trained to be Adaptive. (arXiv:2207.02200v1 [cs.LG])
    Offline RL algorithms must account for the fact that the dataset they are provided may leave many facets of the environment unknown. The most common way to approach this challenge is to employ pessimistic or conservative methods, which avoid behaviors that are too dissimilar from those in the training dataset. However, relying exclusively on conservatism has drawbacks: performance is sensitive to the exact degree of conservatism, and conservative objectives can recover highly suboptimal policies. In this work, we propose that offline RL methods should instead be adaptive in the presence of uncertainty. We show that acting optimally in offline RL in a Bayesian sense involves solving an implicit POMDP. As a result, optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning such adaptive policies in offline RL benchmarks.  ( 2 min )
    An Intrusion Detection System based on Deep Belief Networks. (arXiv:2207.02117v1 [cs.CR])
    The rapid growth of connected devices has led to the proliferation of novel cyber-security threats known as zero-day attacks. Traditional behaviour-based IDS rely on DNN to detect these attacks. The quality of the dataset used to train the DNN plays a critical role in the detection performance, with underrepresented samples causing poor performances. In this paper, we develop and evaluate the performance of DBN on detecting cyber-attacks within a network of connected devices. The CICIDS2017 dataset was used to train and evaluate the performance of our proposed DBN approach. Several class balancing techniques were applied and evaluated. Lastly, we compare our approach against a conventional MLP model and the existing state-of-the-art. Our proposed DBN approach shows competitive and promising results, with significant performance improvement on the detection of attacks underrepresented in the training dataset.  ( 2 min )
    Investigating Why Contrastive Learning Benefits Robustness Against Label Noise. (arXiv:2201.12498v4 [cs.LG] UPDATED)
    Self-supervised Contrastive Learning (CL) has been recently shown to be very effective in preventing deep networks from overfitting noisy labels. Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that the representation matrix learned by contrastive learning boosts robustness, by having: (i) one prominent singular value corresponding to each sub-class in the data, and significantly smaller remaining singular values; and (ii) {a large alignment between the prominent singular vectors and the clean labels of each sub-class. The above properties enable a linear layer trained on such representations to effectively learn the clean labels without overfitting the noise.} We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve state-of-the-art performance under extreme noise levels, e.g., an average of 27.18\% and 15.58\% increase in accuracy on CIFAR-10 and CIFAR-100 with 80\% symmetric noisy labels, and 4.11\% increase in accuracy on WebVision.  ( 3 min )
    Formalizing and Estimating Distribution Inference Risks. (arXiv:2109.06024v6 [cs.LG] UPDATED)
    Distribution inference, sometimes called property inference, infers statistical properties about a training set from access to a model trained on that data. Distribution inference attacks can pose serious risks when models are trained on private data, but are difficult to distinguish from the intrinsic purpose of statistical machine learning -- namely, to produce models that capture statistical properties about a distribution. Motivated by Yeom et al.'s membership inference framework, we propose a formal definition of distribution inference attacks that is general enough to describe a broad class of attacks distinguishing between possible training distributions. We show how our definition captures previous ratio-based property inference attacks as well as new kinds of attack including revealing the average node degree or clustering coefficient of a training graph. To understand distribution inference risks, we introduce a metric that quantifies observed leakage by relating it to the leakage that would occur if samples from the training distribution were provided directly to the adversary. We report on a series of experiments across a range of different distributions using both novel black-box attacks and improved versions of the state-of-the-art white-box attacks. Our results show that inexpensive attacks are often as effective as expensive meta-classifier attacks, and that there are surprising asymmetries in the effectiveness of attacks. Code is available at https://github.com/iamgroot42/FormEstDistRisks  ( 3 min )
    Synthesizing Speech from Intracranial Depth Electrodes using an Encoder-Decoder Framework. (arXiv:2111.01457v2 [cs.SD] UPDATED)
    Speech Neuroprostheses have the potential to enable communication for people with dysarthria or anarthria. Recent advances have demonstrated high-quality text decoding and speech synthesis from electrocorticographic grids placed on the cortical surface. Here, we investigate a less invasive measurement modality in three participants, namely stereotactic EEG (sEEG) that provides sparse sampling from multiple brain regions, including subcortical regions. To evaluate whether sEEG can also be used to synthesize high-quality audio from neural recordings, we employ a recurrent encoder-decoder model based on modern deep learning methods. We find that speech can indeed be reconstructed with correlations up to 0.8 from these minimally invasive recordings, despite limited amounts of training data.  ( 2 min )
    Creativity and Machine Learning: A Survey. (arXiv:2104.02726v3 [cs.LG] UPDATED)
    There is a growing interest in the area of machine learning and creativity. This survey presents an overview of the history and the state of the art of computational creativity theories, key machine learning techniques (including generative deep learning), and corresponding automatic evaluation methods. After presenting a critical discussion of the key contributions in this area, we outline the current research challenges and emerging opportunities in this field.  ( 2 min )
    Frustratingly Easy Transferability Estimation. (arXiv:2106.09362v3 [cs.LG] UPDATED)
    Transferability estimation has been an essential tool in selecting a pre-trained model and the layers in it for transfer learning, to transfer, so as to maximize the performance on a target task and prevent negative transfer. Existing estimation algorithms either require intensive training on target tasks or have difficulties in evaluating the transferability between layers. To this end, we propose a simple, efficient, and effective transferability measure named TransRate. Through a single pass over examples of a target task, TransRate measures the transferability as the mutual information between features of target examples extracted by a pre-trained model and their labels. We overcome the challenge of efficient mutual information estimation by resorting to coding rate that serves as an effective alternative to entropy. From the perspective of feature representation, the resulting TransRate evaluates both completeness (whether features contain sufficient information of a target task) and compactness (whether features of each class are compact enough for good generalization) of pre-trained features. Theoretically, we have analyzed the close connection of TransRate to the performance after transfer learning. Despite its extraordinary simplicity in 10 lines of codes, TransRate performs remarkably well in extensive evaluations on 32 pre-trained models and 16 downstream tasks.  ( 3 min )
    Balancing Profit, Risk, and Sustainability for Portfolio Management. (arXiv:2207.02134v1 [q-fin.PM])
    Stock portfolio optimization is the process of continuous reallocation of funds to a selection of stocks. This is a particularly well-suited problem for reinforcement learning, as daily rewards are compounding and objective functions may include more than just profit, e.g., risk and sustainability. We developed a novel utility function with the Sharpe ratio representing risk and the environmental, social, and governance score (ESG) representing sustainability. We show that a state-of-the-art policy gradient method - multi-agent deep deterministic policy gradients (MADDPG) - fails to find the optimum policy due to flat policy gradients and we therefore replaced gradient descent with a genetic algorithm for parameter optimization. We show that our system outperforms MADDPG while improving on deep Q-learning approaches by allowing for continuous action spaces. Crucially, by incorporating risk and sustainability criteria in the utility function, we improve on the state-of-the-art in reinforcement learning for portfolio optimization; risk and sustainability are essential in any modern trading strategy and we propose a system that does not merely report these metrics, but that actively optimizes the portfolio to improve on them.  ( 2 min )
    Continual 3D Convolutional Neural Networks for Real-time Processing of Videos. (arXiv:2106.00050v3 [cs.CV] UPDATED)
    We introduce Continual 3D Convolutional Neural Networks (Co3D CNNs), a new computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip. In online tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in overlapping clips. We show that Continual 3D CNNs can reuse preexisting 3D-CNN weights to reduce the per-prediction floating point operations (FLOPs) in proportion to the temporal receptive field while retaining similar memory requirements and accuracy. This is validated with multiple models on Kinetics-400 and Charades with remarkable results: CoX3D models attain state-of-the-art complexity/accuracy trade-offs on Kinetics-400 with 12.1-15.3x reductions of FLOPs and 2.3-3.8% improvements in accuracy compared to regular X3D models while reducing peak memory consumption by up to 48%. Moreover, we investigate the transient response of Co3D CNNs at start-up and perform extensive benchmarks of on-hardware processing characteristics for publicly available 3D CNNs.  ( 2 min )
    opPINN: Physics-Informed Neural Network with operator learning to approximate solutions to the Fokker-Planck-Landau equation. (arXiv:2207.01765v1 [math.NA])
    We propose a hybrid framework opPINN: physics-informed neural network (PINN) with operator learning for approximating the solution to the Fokker-Planck-Landau (FPL) equation. The opPINN framework is divided into two steps: Step 1 and Step 2. After the operator surrogate models are trained during Step 1, PINN can effectively approximate the solution to the FPL equation during Step 2 by using the pre-trained surrogate models. The operator surrogate models greatly reduce the computational cost and boost PINN by approximating the complex Landau collision integral in the FPL equation. The operator surrogate models can also be combined with the traditional numerical schemes. It provides a high efficiency in computational time when the number of velocity modes becomes larger. Using the opPINN framework, we provide the neural network solutions for the FPL equation under the various types of initial conditions, and interaction models in two and three dimensions. Furthermore, based on the theoretical properties of the FPL equation, we show that the approximated neural network solution converges to the a priori classical solution of the FPL equation as the pre-defined loss function is reduced.  ( 2 min )
    Efficient Representation Learning via Adaptive Context Pooling. (arXiv:2207.01844v1 [cs.LG])
    Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.  ( 2 min )
    A Unified Meta-Learning Framework for Dynamic Transfer Learning. (arXiv:2207.01784v1 [cs.LG])
    Transfer learning refers to the transfer of knowledge or information from a relevant source task to a target task. However, most existing works assume both tasks are sampled from a stationary task distribution, thereby leading to the sub-optimal performance for dynamic tasks drawn from a non-stationary task distribution in real scenarios. To bridge this gap, in this paper, we study a more realistic and challenging transfer learning setting with dynamic tasks, i.e., source and target tasks are continuously evolving over time. We theoretically show that the expected error on the dynamic target task can be tightly bounded in terms of source knowledge and consecutive distribution discrepancy across tasks. This result motivates us to propose a generic meta-learning framework L2E for modeling the knowledge transferability on dynamic tasks. It is centered around a task-guided meta-learning problem with a group of meta-pairs of tasks, based on which we are able to learn the prior model initialization for fast adaptation on the newest target task. L2E enjoys the following properties: (1) effective knowledge transferability across dynamic tasks; (2) fast adaptation to the new target task; (3) mitigation of catastrophic forgetting on historical target tasks; and (4) flexibility in incorporating any existing static transfer learning algorithms. Extensive experiments on various image data sets demonstrate the effectiveness of the proposed L2E framework.  ( 2 min )
    Vector Quantisation for Robust Segmentation. (arXiv:2207.01919v1 [eess.IV])
    The reliability of segmentation models in the medical domain depends on the model's robustness to perturbations in the input space. Robustness is a particular challenge in medical imaging exhibiting various sources of image noise, corruptions, and domain shifts. Obtaining robustness is often attempted via simulating heterogeneous environments, either heuristically in the form of data augmentation or by learning to generate specific perturbations in an adversarial manner. We propose and justify that learning a discrete representation in a low dimensional embedding space improves robustness of a segmentation model. This is achieved with a dictionary learning method called vector quantisation. We use a set of experiments designed to analyse robustness in both the latent and output space under domain shift and noise perturbations in the input space. We adapt the popular UNet architecture, inserting a quantisation block in the bottleneck. We demonstrate improved segmentation accuracy and better robustness on three segmentation tasks. Code is available at \url{https://github.com/AinkaranSanthi/Vector-Quantisation-for-Robust-Segmentation}  ( 2 min )
    Randomized-to-Canonical Model Predictive Control for Real-world Visual Robotic Manipulation. (arXiv:2207.01840v1 [cs.RO])
    Many works have recently explored Sim-to-real transferable visual model predictive control (MPC). However, such works are limited to one-shot transfer, where real-world data must be collected once to perform the sim-to-real transfer, which remains a significant human effort in transferring the models learned in simulations to new domains in the real world. To alleviate this problem, we first propose a novel model-learning framework called Kalman Randomized-to-Canonical Model (KRC-model). This framework is capable of extracting task-relevant intrinsic features and their dynamics from randomized images. We then propose Kalman Randomized-to-Canonical Model Predictive Control (KRC-MPC) as a zero-shot sim-to-real transferable visual MPC using KRC-model. The effectiveness of our method is evaluated through a valve rotation task by a robot hand in both simulation and the real world, and a block mating task in simulation. The experimental results show that KRC-MPC can be applied to various real domains and tasks in a zero-shot manner.  ( 2 min )
    Machine Learning in Access Control: A Taxonomy and Survey. (arXiv:2207.01739v1 [cs.CR])
    An increasing body of work has recognized the importance of exploiting machine learning (ML) advancements to address the need for efficient automation in extracting access control attributes, policy mining, policy verification, access decisions, etc. In this work, we survey and summarize various ML approaches to solve different access control problems. We propose a novel taxonomy of the ML model's application in the access control domain. We highlight current limitations and open challenges such as lack of public real-world datasets, administration of ML-based access control systems, understanding a black-box ML model's decision, etc., and enumerate future research directions.  ( 2 min )
    Anomaly-aware multiple instance learning for rare anemia disorder classification. (arXiv:2207.01742v1 [cs.LG])
    Deep learning-based classification of rare anemia disorders is challenged by the lack of training data and instance-level annotations. Multiple Instance Learning (MIL) has shown to be an effective solution, yet it suffers from low accuracy and limited explainability. Although the inclusion of attention mechanisms has addressed these issues, their effectiveness highly depends on the amount and diversity of cells in the training samples. Consequently, the poor machine learning performance on rare anemia disorder classification from blood samples remains unresolved. In this paper, we propose an interpretable pooling method for MIL to address these limitations. By benefiting from instance-level information of negative bags (i.e., homogeneous benign cells from healthy individuals), our approach increases the contribution of anomalous instances. We show that our strategy outperforms standard MIL classification algorithms and provides a meaningful explanation behind its decisions. Moreover, it can denote anomalous instances of rare blood diseases that are not seen during the training phase.  ( 2 min )
    On A Mallows-type Model For (Ranked) Choices. (arXiv:2207.01783v1 [cs.LG])
    In a preference learning setting, every participant chooses an ordered list of $k$ most preferred items among a displayed set of candidates. (The set can be different for every participant.) We identify a distance-based ranking model for the population's preferences and their (ranked) choice behavior. The ranking model resembles the Mallows model but uses a new distance function called Reverse Major Index (RMJ). We find that despite the need to sum over all permutations, the RMJ-based ranking distribution aggregates into (ranked) choice probabilities with simple closed-form expression. We develop effective methods to estimate the model parameters and showcase their generalization power using real data, especially when there is a limited variety of display sets.  ( 2 min )
    Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discovery. (arXiv:2207.01822v1 [cs.LG])
    Healthcare datasets present many challenges to both machine learning and statistics as their data are typically heterogeneous, censored, high-dimensional and have missing information. Feature selection is often used to identify the important features but can produce unstable results when applied to high-dimensional data, selecting a different set of features on each iteration. The stability of feature selection can be improved with the use of feature selection ensembles, which aggregate the results of multiple base feature selectors. A threshold must be applied to the final aggregated feature set to separate the relevant features from the redundant ones. A fixed threshold, which is typically applied, offers no guarantee that the final set of selected features contains only relevant features. This work develops several data-driven thresholds to automatically identify the relevant features in an ensemble feature selector and evaluates their predictive accuracy and stability. To demonstrate the applicability of these methods to clinical data, they are applied to data from two real-world Alzheimer's disease (AD) studies. AD is a progressive neurodegenerative disease with no known cure, that begins at least 2-3 decades before overt symptoms appear, presenting an opportunity for researchers to identify early biomarkers that might identify patients at risk of developing AD. Features identified by applying these methods to both datasets reflect current findings in the AD literature.  ( 3 min )
    Discrete Tree Flows via Tree-Structured Permutations. (arXiv:2207.01744v1 [cs.LG])
    While normalizing flows for continuous data have been extensively researched, flows for discrete data have only recently been explored. These prior models, however, suffer from limitations that are distinct from those of continuous flows. Most notably, discrete flow-based models cannot be straightforwardly optimized with conventional deep learning methods because gradients of discrete functions are undefined or zero. Previous works approximate pseudo-gradients of the discrete functions but do not solve the problem on a fundamental level. In addition to that, backpropagation can be computationally burdensome compared to alternative discrete algorithms such as decision tree algorithms. Our approach seeks to reduce computational burden and remove the need for pseudo-gradients by developing a discrete flow based on decision trees -- building upon the success of efficient tree-based methods for classification and regression for discrete data. We first define a tree-structured permutation (TSP) that compactly encodes a permutation of discrete data where the inverse is easy to compute; thus, we can efficiently compute the density value and sample new data. We then propose a decision tree algorithm to build TSPs that learns the tree structure and permutations at each node via novel criteria. We empirically demonstrate the feasibility of our method on multiple datasets.  ( 2 min )
    GSMFlow: Generation Shifts Mitigating Flow for Generalized Zero-Shot Learning. (arXiv:2207.01798v1 [cs.CV])
    Generalized Zero-Shot Learning (GZSL) aims to recognize images from both the seen and unseen classes by transferring semantic knowledge from seen to unseen classes. It is a promising solution to take the advantage of generative models to hallucinate realistic unseen samples based on the knowledge learned from the seen classes. However, due to the generation shifts, the synthesized samples by most existing methods may drift from the real distribution of the unseen data. To address this issue, we propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation. Specifically, we discover and address three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance collapse, and structure disorder. First, to enhance the reflection of the semantic information in the generated samples, we explicitly embed the semantic information into the transformation in each conditional affine coupling layer. Second, to recover the intrinsic variance of the real unseen features, we introduce a boundary sample mining strategy with entropy maximization to discover more difficult visual variants of semantic prototypes and hereby adjust the decision boundary of the classifiers. Third, a relative positioning strategy is proposed to revise the attribute embeddings, guiding them to fully preserve the inter-class geometric structure and further avoid structure disorder in the semantic space. Extensive experimental results on four GZSL benchmark datasets demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.  ( 3 min )
    How Much More Data Do I Need? Estimating Requirements for Downstream Tasks. (arXiv:2207.01725v1 [cs.CV])
    Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance? This question is of critical importance in applications such as autonomous driving or medical imaging where collecting data is expensive and time-consuming. Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget. Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements. Finally, we show that incorporating a tuned correction factor and collecting over multiple rounds significantly improves the performance of the data estimators. Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.  ( 3 min )
    CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. (arXiv:2207.01780v1 [cs.LG])
    Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.  ( 3 min )
    PoF: Post-Training of Feature Extractor for Improving Generalization. (arXiv:2207.01847v1 [cs.LG])
    It has been intensively investigated that the local shape, especially flatness, of the loss landscape near a minimum plays an important role for generalization of deep models. We developed a training algorithm called PoF: Post-Training of Feature Extractor that updates the feature extractor part of an already-trained deep model to search a flatter minimum. The characteristics are two-fold: 1) Feature extractor is trained under parameter perturbations in the higher-layer parameter space, based on observations that suggest flattening higher-layer parameter space, and 2) the perturbation range is determined in a data-driven manner aiming to reduce a part of test loss caused by the positive loss curvature. We provide a theoretical analysis that shows the proposed algorithm implicitly reduces the target Hessian components as well as the loss. Experimental results show that PoF improved model performance against baseline methods on both CIFAR-10 and CIFAR-100 datasets for only 10-epoch post-training, and on SVHN dataset for 50-epoch post-training. Source code is available at: \url{https://github.com/DensoITLab/PoF-v1  ( 2 min )
    Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. (arXiv:2207.02016v1 [cs.LG])
    Reinforcement learning (RL) is recognized as lacking generalization and robustness under environmental perturbations, which excessively restricts its application for real-world robotics. Prior work claimed that adding regularization to the value function is equivalent to learning a robust policy with uncertain transitions. Although the regularization-robustness transformation is appealing for its simplicity and efficiency, it is still lacking in continuous control tasks. In this paper, we propose a new regularizer named $\textbf{U}$ncertainty $\textbf{S}$et $\textbf{R}$egularizer (USR), by formulating the uncertainty set on the parameter space of the transition function. In particular, USR is flexible enough to be plugged into any existing RL framework. To deal with unknown uncertainty sets, we further propose a novel adversarial approach to generate them based on the value function. We evaluate USR on the Real-world Reinforcement Learning (RWRL) benchmark, demonstrating improvements in the robust performance for perturbed testing environments.  ( 2 min )
    Do Not Take It for Granted: Comparing Open-Source Libraries for Software Development Effort Estimation. (arXiv:2207.01705v1 [cs.SE])
    In the past two decades, several Machine Learning (ML) libraries have become freely available. Many studies have used such libraries to carry out empirical investigations on predictive Software Engineering (SE) tasks. However, the differences stemming from using one library over another have been overlooked, implicitly assuming that using any of these libraries would provide the user with the same or very similar results. This paper aims at raising awareness of the differences incurred when using different ML libraries for software development effort estimation (SEE), one of most widely studied SE prediction tasks. To this end, we investigate 4 deterministic machine learners as provided by 3 of the most popular ML open-source libraries written in different languages (namely, Scikit-Learn, Caret and Weka). We carry out a thorough empirical study comparing the performance of the machine learners on 5 SEE datasets in the two most common SEE scenarios (i.e., out-of-the-box-ml and tuned-ml) as well as an in-depth analysis of the documentation and code of their APIs. The results of our study reveal that the predictions provided by the 3 libraries differ in 95% of the cases on average across a total of 105 cases studied. These differences are significantly large in most cases and yield misestimations of up to approx. 3,000 hours per project. Moreover, our API analysis reveals that these libraries provide the user with different levels of control on the parameters one can manipulate, and a lack of clarity and consistency, overall, which might mislead users. Our findings highlight that the ML library is an important design choice for SEE studies, which can lead to a difference in performance. However, such a difference is under-documented. We conclude by highlighting open-challenges with suggestions for the developers of libraries as well as for the researchers and practitioners using them.  ( 3 min )
    The Deep Ritz Method for Parametric $p$-Dirichlet Problems. (arXiv:2207.01894v1 [math.NA])
    We establish error estimates for the approximation of parametric $p$-Dirichlet problems deploying the Deep Ritz Method. Parametric dependencies include, e.g., varying geometries and exponents $p\in (1,\infty)$. Combining the derived error estimates with quantitative approximation theorems yields error decay rates and establishes that the Deep Ritz Method retains the favorable approximation capabilities of neural networks in the approximation of high dimensional functions which makes the method attractive for parametric problems. Finally, we present numerical examples to illustrate potential applications.  ( 2 min )
    What Do Graph Convolutional Neural Networks Learn?. (arXiv:2207.01839v1 [cs.LG])
    Graph neural networks (GNNs) have gained traction over the past few years for their superior performance in numerous machine learning tasks. Graph Convolutional Neural Networks (GCN) are a common variant of GNNs that are known to have high performance in semi-supervised node classification (SSNC), and work well under the assumption of homophily. Recent literature has highlighted that GCNs can achieve strong performance on heterophilous graphs under certain "special conditions". These arguments motivate us to understand why, and how, GCNs learn to perform SSNC. We find a positive correlation between similarity of latent node embeddings of nodes within a class and the performance of a GCN. Our investigation on underlying graph structures of a dataset finds that a GCN's SSNC performance is significantly influenced by the consistency and uniqueness in neighborhood structure of nodes within a class.  ( 2 min )
    FACT: High-Dimensional Random Forests Inference. (arXiv:2207.01678v1 [stat.ML])
    Random forests is one of the most widely used machine learning methods over the past decade thanks to its outstanding empirical performance. Yet, because of its black-box nature, the results by random forests can be hard to interpret in many big data applications. Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. The vanilla version of our FACT test can suffer from the bias issue in the presence of feature dependency. We exploit the techniques of imbalancing and conditioning for bias correction. We further incorporate the ensemble idea into the FACT statistic through feature transformations for the enhanced power. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified random forests feature p-values and enjoy appealing power through nonasymptotic analyses. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application in relation to COVID-19.  ( 3 min )
    Slice-by-slice deep learning aided oropharyngeal cancer segmentation with adaptive thresholding for spatial uncertainty on FDG PET and CT images. (arXiv:2207.01623v1 [eess.IV])
    Tumor segmentation is a fundamental step for radiotherapy treatment planning. To define an accurate segmentation of the primary tumor (GTVp) of oropharyngeal cancer patients (OPC), simultaneous assessment of different image modalities is needed, and each image volume is explored slice-by-slice from different orientations. Moreover, the manual fixed boundary of segmentation neglects the spatial uncertainty known to occur in tumor delineation. This study proposes a novel automatic deep learning (DL) model to assist radiation oncologists in a slice-by-slice adaptive GTVp segmentation on registered FDG PET/CT images. We included 138 OPC patients treated with (chemo)radiation in our institute. Our DL framework exploits both inter and intra-slice context. Sequences of 3 consecutive 2D slices of concatenated FDG PET/CT images and GTVp contours were used as input. A 3-fold cross validation was performed three times, training on sequences extracted from the Axial (A), Sagittal (S), and Coronal (C) plane of 113 patients. Since consecutive sequences in a volume contain overlapping slices, each slice resulted in three outcome predictions that were averaged. In the A, S, and C planes, the output shows areas with different probabilities of predicting the tumor. The performance of the models was assessed on 25 patients at different probability thresholds using the mean Dice Score Coefficient (DSC). Predictions were the closest to the ground truth at a probability threshold of 0.9 (DSC of 0.70 in the A, 0.77 in the S, and 0.80 in the C plane). The promising results of the proposed DL model show that the probability maps on registered FDG PET/CT images could guide radiation oncologists in a slice-by-slice adaptive GTVp segmentation.  ( 3 min )
  • Open

    Offline RL Policies Should be Trained to be Adaptive. (arXiv:2207.02200v1 [cs.LG])
    Offline RL algorithms must account for the fact that the dataset they are provided may leave many facets of the environment unknown. The most common way to approach this challenge is to employ pessimistic or conservative methods, which avoid behaviors that are too dissimilar from those in the training dataset. However, relying exclusively on conservatism has drawbacks: performance is sensitive to the exact degree of conservatism, and conservative objectives can recover highly suboptimal policies. In this work, we propose that offline RL methods should instead be adaptive in the presence of uncertainty. We show that acting optimally in offline RL in a Bayesian sense involves solving an implicit POMDP. As a result, optimal policies for offline RL must be adaptive, depending not just on the current state but rather all the transitions seen so far during evaluation.We present a model-free algorithm for approximating this optimal adaptive policy, and demonstrate the efficacy of learning such adaptive policies in offline RL benchmarks.  ( 2 min )
    A Generative Framework for Personalized Learning and Estimation: Theory, Algorithms, and Privacy. (arXiv:2207.01771v1 [cs.LG])
    A distinguishing characteristic of federated learning is that the (local) client data could have statistical heterogeneity. This heterogeneity has motivated the design of personalized learning, where individual (personalized) models are trained, through collaboration. There have been various personalization methods proposed in literature, with seemingly very different forms and methods ranging from use of a single global model for local regularization and model interpolation, to use of multiple global models for personalized clustering, etc. In this work, we begin with a generative framework that could potentially unify several different algorithms as well as suggest new algorithms. We apply our generative framework to personalized estimation, and connect it to the classical empirical Bayes' methodology. We develop private personalized estimation under this framework. We then use our generative framework for learning, which unifies several known personalized FL algorithms and also suggests new ones; we propose and study a new algorithm AdaPeD based on a Knowledge Distillation, which numerically outperforms several known algorithms. We also develop privacy for personalized learning methods with guarantees for user-level privacy and composition. We numerically evaluate the performance as well as the privacy for both the estimation and learning problems, demonstrating the advantages of our proposed methods.
    On Effective Scheduling of Model-based Reinforcement Learning. (arXiv:2111.08550v3 [cs.LG] UPDATED)
    Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.
    DAS-PINNs: A deep adaptive sampling method for solving high-dimensional partial differential equations. (arXiv:2112.14038v2 [math.NA] UPDATED)
    In this work we propose a deep adaptive sampling (DAS) method for solving partial differential equations (PDEs), where deep neural networks are utilized to approximate the solutions of PDEs and deep generative models are employed to generate new collocation points that refine the training set. The overall procedure of DAS consists of two components: solving the PDEs by minimizing the residual loss on the collocation points in the training set and generating a new training set to further improve the accuracy of current approximate solution. In particular, we treat the residual as a probability density function and approximate it with a deep generative model, called KRnet. The new samples from KRnet are consistent with the distribution induced by the residual, i.e., more samples are located in the region of large residual and less samples are located in the region of small residual. Analogous to classical adaptive methods such as the adaptive finite element, KRnet acts as an error indicator that guides the refinement of the training set. Compared to the neural network approximation obtained with uniformly distributed collocation points, the developed algorithms can significantly improve the accuracy, especially for low regularity and high-dimensional problems. We demonstrate the effectiveness of the proposed DAS method with numerical experiments.
    Learning Stochastic Shortest Path with Linear Function Approximation. (arXiv:2110.12727v3 [cs.LG] UPDATED)
    We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain an $\tilde{\mathcal{O}}(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_{\star}$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, and an $\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves an $\tilde{\mathcal{O}}(d B_{\star}\sqrt{K/c_{\min}})$ regret. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(dB_{\star} \sqrt{K})$. Hence, our improved algorithm matches the lower bound up to a $1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.
    An Approximation Method for Fitted Random Forests. (arXiv:2207.02184v1 [stat.ML])
    Random Forests (RF) is a popular machine learning method for classification and regression problems. It involves a bagging application to decision tree models. One of the primary advantages of the Random Forests model is the reduction in the variance of the forecast. In large scale applications of the model with millions of data points and hundreds of features, the size of the fitted objects can get very large and reach the limits on the available space in production setups, depending on the number and depth of the trees. This could be especially challenging when trained models need to be downloaded on-demand to small devices with limited memory. There is a need to approximate the trained RF models to significantly reduce the model size without losing too much of prediction accuracy. In this project we study methods that approximate each fitted tree in the Random Forests model using the multinomial allocation of the data points to the leafs. Specifically, we begin by studying whether fitting a multinomial logistic regression (and subsequently, a generalized additive model (GAM) extension) to the output of each tree helps reduce the size while preserving the prediction quality.  ( 2 min )
    PRoA: A Probabilistic Robustness Assessment against Functional Perturbations. (arXiv:2207.02036v1 [cs.LG])
    In safety-critical deep learning applications robustness measurement is a vital pre-deployment phase. However, existing robustness verification methods are not sufficiently practical for deploying machine learning systems in the real world. On the one hand, these methods attempt to claim that no perturbations can ``fool'' deep neural networks (DNNs), which may be too stringent in practice. On the other hand, existing works rigorously consider $L_p$ bounded additive perturbations on the pixel space, although perturbations, such as colour shifting and geometric transformations, are more practically and frequently occurring in the real world. Thus, from the practical standpoint, we present a novel and general {\it probabilistic robustness assessment method} (PRoA) based on the adaptive concentration, and it can measure the robustness of deep learning models against functional perturbations. PRoA can provide statistical guarantees on the probabilistic robustness of a model, \textit{i.e.}, the probability of failure encountered by the trained model after deployment. Our experiments demonstrate the effectiveness and flexibility of PRoA in terms of evaluating the probabilistic robustness against a broad range of functional perturbations, and PRoA can scale well to various large-scale deep neural networks compared to existing state-of-the-art baselines. For the purpose of reproducibility, we release our tool on GitHub: \url{ https://github.com/TrustAI/PRoA}.
    Graph Clustering with Graph Neural Networks. (arXiv:2006.16904v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved state-of-the-art results on many graph analysis tasks such as node classification and link prediction. However, important unsupervised problems on graphs, such as graph clustering, have proved more resistant to advances in GNNs. Graph clustering has the same overall goal as node pooling in GNNs - does this mean that GNN pooling methods do a good job at clusterings graphs? Surprisingly, the answer is no - current GNN pooling methods often fail to recover the cluster structure in cases where simple baselines, such as k-means applied on learned representations, work well. We investigate further by carefully designing a set of experiments to study different signal-to-noise scenarios both in graph structure and attribute data. To address these methods' poor performance in clustering, we introduce Deep Modularity Networks (DMoN), an unsupervised pooling method inspired by the modularity measure of clustering quality, and show how it tackles recovery of the challenging clustering structure of real-world graphs. Similarly, on real-world data, we show that DMoN produces high quality clusters which correlate strongly with ground truth labels, achieving state-of-the-art results with over 40% improvement over other pooling methods across different metrics.
    Variational Bayes for high-dimensional proportional hazards models with applications within gene expression. (arXiv:2112.10270v2 [stat.ME] UPDATED)
    Few Bayesian methods for analyzing high-dimensional sparse survival data provide scalable variable selection, effect estimation and uncertainty quantification. Such methods often either sacrifice uncertainty quantification by computing maximum a posteriori estimates, or quantify the uncertainty at high (unscalable) computational expense. We bridge this gap and develop an interpretable and scalable Bayesian proportional hazards model for prediction and variable selection, referred to as SVB. Our method, based on a mean-field variational approximation, overcomes the high computational cost of MCMC whilst retaining useful features, providing a posterior distribution for the parameters and offering a natural mechanism for variable selection via posterior inclusion probabilities. The performance of our proposed method is assessed via extensive simulations and compared against other state-of-the-art Bayesian variable selection methods, demonstrating comparable or better performance. Finally, we demonstrate how the proposed method can be used for variable selection on two transcriptomic datasets with censored survival outcomes, and how the uncertainty quantification offered by our method can be used to provide an interpretable assessment of patient risk.
    An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge. (arXiv:2205.14775v2 [stat.ML] UPDATED)
    We propose an algorithm for non-stationary kernel bandits that does not require prior knowledge of the degree of non-stationarity. The algorithm follows randomized strategies obtained by solving optimization problems that balance exploration and exploitation. It adapts to non-stationarity by restarting when a change in the reward function is detected. Our algorithm enjoys a tighter dynamic regret bound than previous work on the non-stationary kernel bandit setting. Moreover, when applied to the non-stationary linear bandit setting by using a linear kernel, our algorithm is nearly minimax optimal, solving an open problem in the non-stationary linear bandit literature. We extend our algorithm to use a neural network for dynamically adapting the feature mapping to observed data. We prove a dynamic regret bound of the extension using the neural tangent kernel theory. We demonstrate empirically that our algorithm and the extension can adapt to varying degrees of non-stationarity.
    A survey of multimodal deep generative models. (arXiv:2207.02127v1 [cs.LG])
    Multimodal learning is a framework for building models that make predictions based on different types of modalities. Important challenges in multimodal learning are the inference of shared representations from arbitrary modalities and cross-modal generation via these representations; however, achieving this requires taking the heterogeneous nature of multimodal data into account. In recent years, deep generative models, i.e., generative models in which distributions are parameterized by deep neural networks, have attracted much attention, especially variational autoencoders, which are suitable for accomplishing the above challenges because they can consider heterogeneity and infer good representations of data. Therefore, various multimodal generative models based on variational autoencoders, called multimodal deep generative models, have been proposed in recent years. In this paper, we provide a categorized survey of studies on multimodal deep generative models.
    $\pi$VAE: a stochastic process prior for Bayesian deep learning with MCMC. (arXiv:2002.06873v5 [cs.LG] UPDATED)
    Stochastic processes provide a mathematically elegant way model complex data. In theory, they provide flexible priors over function classes that can encode a wide range of interesting assumptions. In practice, however, efficient inference by optimisation or marginalisation is difficult, a problem further exacerbated with big data and high dimensional input spaces. We propose a novel variational autoencoder (VAE) called the prior encoding variational autoencoder ($\pi$VAE). The $\pi$VAE is finitely exchangeable and Kolmogorov consistent, and thus is a continuous stochastic process. We use $\pi$VAE to learn low dimensional embeddings of function classes. We show that our framework can accurately learn expressive function classes such as Gaussian processes, but also properties of functions to enable statistical inference (such as the integral of a log Gaussian process). For popular tasks, such as spatial interpolation, $\pi$VAE achieves state-of-the-art performance both in terms of accuracy and computational efficiency. Perhaps most usefully, we demonstrate that the low dimensional independently distributed latent space representation learnt provides an elegant and scalable means of performing Bayesian inference for stochastic processes within probabilistic programming languages such as Stan.
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v2 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.
    Learning Optimal Transport Between two Empirical Distributions with Normalizing Flows. (arXiv:2207.01246v2 [cs.LG] UPDATED)
    Optimal transport (OT) provides effective tools for comparing and mapping probability measures. We propose to leverage the flexibility of neural networks to learn an approximate optimal transport map. More precisely, we present a new and original method to address the problem of transporting a finite set of samples associated with a first underlying unknown distribution towards another finite set of samples drawn from another unknown distribution. We show that a particular instance of invertible neural networks, namely the normalizing flows, can be used to approximate the solution of this OT problem between a pair of empirical distributions. To this aim, we propose to relax the Monge formulation of OT by replacing the equality constraint on the push-forward measure by the minimization of the corresponding Wasserstein distance. The push-forward operator to be retrieved is then restricted to be a normalizing flow which is trained by optimizing the resulting cost function. This approach allows the transport map to be discretized as a composition of functions. Each of these functions is associated to one sub-flow of the network, whose output provides intermediate steps of the transport between the original and target measures. This discretization yields also a set of intermediate barycenters between the two measures of interest. Experiments conducted on toy examples as well as a challenging task of unsupervised translation demonstrate the interest of the proposed method. Finally, some experiments show that the proposed approach leads to a good approximation of the true OT.
    FACT: High-Dimensional Random Forests Inference. (arXiv:2207.01678v1 [stat.ML])
    Random forests is one of the most widely used machine learning methods over the past decade thanks to its outstanding empirical performance. Yet, because of its black-box nature, the results by random forests can be hard to interpret in many big data applications. Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. The vanilla version of our FACT test can suffer from the bias issue in the presence of feature dependency. We exploit the techniques of imbalancing and conditioning for bias correction. We further incorporate the ensemble idea into the FACT statistic through feature transformations for the enhanced power. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified random forests feature p-values and enjoy appealing power through nonasymptotic analyses. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application in relation to COVID-19.
    Meta-Learning a Real-Time Tabular AutoML Method For Small Data. (arXiv:2207.01848v1 [cs.LG])
    We present TabPFN, an AutoML method that is competitive with the state of the art on small tabular datasets while being over 1,000$\times$ faster. Our method is very simple: it is fully entailed in the weights of a single neural network, and a single forward pass directly yields predictions for a new dataset. Our AutoML method is meta-learned using the Transformer-based Prior-Data Fitted Network (PFN) architecture and approximates Bayesian inference with a prior that is based on assumptions of simplicity and causal structures. The prior contains a large space of structural causal models and Bayesian neural networks with a bias for small architectures and thus low complexity. Furthermore, we extend the PFN approach to differentiably calibrate the prior's hyperparameters on real data. By doing so, we separate our abstract prior assumptions from their heuristic calibration on real data. Afterwards, the calibrated hyperparameters are fixed and TabPFN can be applied to any new tabular dataset at the push of a button. Finally, on 30 datasets from the OpenML-CC18 suite we show that our method outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with predictions produced in less than a second. We provide all our code and our final trained TabPFN in the supplementary materials.
    Making Sense of Dependence: Efficient Black-box Explanations Using Dependence Measure. (arXiv:2206.06219v2 [cs.CV] UPDATED)
    This paper presents a new efficient black-box attribution method based on Hilbert-Schmidt Independence Criterion (HSIC), a dependence measure based on Reproducing Kernel Hilbert Spaces (RKHS). HSIC measures the dependence between regions of an input image and the output of a model based on kernel embeddings of distributions. It thus provides explanations enriched by RKHS representation capabilities. HSIC can be estimated very efficiently, significantly reducing the computational cost compared to other black-box attribution methods. Our experiments show that HSIC is up to 8 times faster than the previous best black-box attribution methods while being as faithful. Indeed, we improve or match the state-of-the-art of both black-box and white-box attribution methods for several fidelity metrics on Imagenet with various recent model architectures. Importantly, we show that these advances can be transposed to efficiently and faithfully explain object detection models such as YOLOv4. Finally, we extend the traditional attribution methods by proposing a new kernel enabling an orthogonal decomposition of importance scores based on HSIC, allowing us to evaluate not only the importance of each image patch but also the importance of their pairwise interactions.
    Estimating means of bounded random variables by betting. (arXiv:2010.09686v6 [math.ST] UPDATED)
    This paper derives confidence intervals (CI) and time-uniform confidence sequences (CS) for the classical problem of estimating an unknown mean from bounded observations. We present a general approach for deriving concentration bounds, that can be seen as a generalization (and improvement) of the celebrated Chernoff method. At its heart, it is based on deriving a new class of composite nonnegative martingales, with strong connections to testing by betting and the method of mixtures. We show how to extend these ideas to sampling without replacement, another heavily studied problem. In all cases, our bounds are adaptive to the unknown variance, and empirically vastly outperform existing approaches based on Hoeffding or empirical Bernstein inequalities and their recent supermartingale generalizations. In short, we establish a new state-of-the-art for four fundamental problems: CSs and CIs for bounded means, when sampling with and without replacement.
    A Probabilistic State Space Model for Joint Inference from Differential Equations and Data. (arXiv:2103.10153v3 [stat.ML] UPDATED)
    Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak.
    DMS, AE, DAA: methods and applications of adaptive time series model selection, ensemble, and financial evaluation. (arXiv:2110.11156v3 [stat.AP] UPDATED)
    We introduce three adaptive time series learning methods, called Dynamic Model Selection (DMS), Adaptive Ensemble (AE), and Dynamic Asset Allocation (DAA). The methods respectively handle model selection, ensembling, and contextual evaluation in financial time series. Empirically, we use the methods to forecast the returns of four key indices in the US market, incorporating information from the VIX and Yield curves. We present financial applications of the learning results, including fully-automated portfolios and dynamic hedging strategies. The strategies strongly outperform long-only benchmarks over our testing period, spanning from Q4 2015 to the end of 2021. The key outputs of the learning methods are interpreted during the 2020 market crash.
    VisRuler: Visual Analytics for Extracting Decision Rules from Bagged and Boosted Decision Trees. (arXiv:2112.00334v3 [cs.LG] UPDATED)
    Bagging and boosting are two popular ensemble methods in machine learning (ML) that produce many individual decision trees. Due to the inherent ensemble characteristic of these methods, they typically outperform single decision trees or other ML models in predictive performance. However, numerous decision paths are generated for each decision tree, increasing the overall complexity of the model and hindering its use in domains that require trustworthy and explainable decisions, such as finance, social care, and health care. Thus, the interpretability of bagging and boosting algorithms, such as random forest and adaptive boosting, reduces as the number of decisions rises. In this paper, we propose a visual analytics tool that aims to assist users in extracting decisions from such ML models via a thorough visual inspection workflow that includes selecting a set of robust and diverse models (originating from different ensemble learning algorithms), choosing important features according to their global contribution, and deciding which decisions are essential for global explanation (or locally, for specific cases). The outcome is a final decision based on the class agreement of several models and the explored manual decisions exported by users. We evaluated the applicability and effectiveness of VisRuler via a use case, a usage scenario, and a user study. The evaluation revealed that most users managed to successfully use our system to explore decision rules visually, performing the proposed tasks and answering the given questions in a satisfying way.
    Regret analysis of the Piyavskii-Shubert algorithm for global Lipschitz optimization. (arXiv:2002.02390v4 [cs.LG] UPDATED)
    We consider the problem of maximizing a non-concave Lipschitz multivariate function over a compact domain by sequentially querying its (possibly perturbed) values. We study a natural algorithm designed originally by Piyavskii and Shubert in 1972, for which we prove new bounds on the number of evaluations of the function needed to reach or certify a given optimization accuracy. Our analysis uses a bandit-optimization viewpoint and solves an open problem from Hansen et al.\ (1991) by bounding the number of evaluations to certify a given accuracy with a near-optimal sum of packing numbers.
    Minimax Estimation of Linear Functions of Eigenvectors in the Face of Small Eigen-Gaps. (arXiv:2104.03298v2 [math.ST] UPDATED)
    Eigenvector perturbation analysis plays a vital role in various data science applications. A large body of prior works, however, focused on establishing $\ell_{2}$ eigenvector perturbation bounds, which are often highly inadequate in addressing tasks that rely on fine-grained behavior of an eigenvector. This paper makes progress on this by studying the perturbation of linear functions of an unknown eigenvector. Focusing on two fundamental problems -- matrix denoising and principal component analysis -- in the presence of Gaussian noise, we develop a suite of statistical theory that characterizes the perturbation of arbitrary linear functions of an unknown eigenvector. In order to mitigate a non-negligible bias issue inherent to the natural ``plug-in'' estimator, we develop de-biased estimators that (1) achieve minimax lower bounds for a family of scenarios (modulo some logarithmic factor), and (2) can be computed in a data-driven manner without sample splitting. Noteworthily, the proposed estimators are nearly minimax optimal even when the associated eigen-gap is {\em substantially smaller} than what is required in prior statistical theory.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v1 [cs.LG])
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of online label shift (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal dynamic regret, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.  ( 2 min )
    A Neural Tangent Kernel Perspective of GANs. (arXiv:2106.05566v4 [cs.LG] UPDATED)
    We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We reveal a fundamental flaw of previous analyses which, by incorrectly modeling GANs' training scheme, are subject to ill-defined discriminator gradients. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network. From this, we derive new insights about the convergence of the generated distribution, advancing our understanding of GANs' training dynamics. We empirically corroborate these results via an analysis toolkit based on our framework, unveiling intuitions that are consistent with GAN practice.
    Best Subset Selection with Efficient Primal-Dual Algorithm. (arXiv:2207.02058v1 [stat.ME])
    Best subset selection is considered the `gold standard' for many sparse learning problems. A variety of optimization techniques have been proposed to attack this non-convex and NP-hard problem. In this paper, we investigate the dual forms of a family of $\ell_0$-regularized problems. An efficient primal-dual method has been developed based on the primal and dual problem structures. By leveraging the dual range estimation along with the incremental strategy, our algorithm potentially reduces redundant computation and improves the solutions of best subset selection. Theoretical analysis and experiments on synthetic and real-world datasets validate the efficiency and statistical properties of the proposed solutions.  ( 2 min )
    Predicting Out-of-Domain Generalization with Local Manifold Smoothness. (arXiv:2207.02093v1 [cs.LG])
    Understanding how machine learning models generalize to new environments is a critical part of their safe deployment. Recent work has proposed a variety of complexity measures that directly predict or theoretically bound the generalization capacity of a model. However, these methods rely on a strong set of assumptions that in practice are not always satisfied. Motivated by the limited settings in which existing measures can be applied, we propose a novel complexity measure based on the local manifold smoothness of a classifier. We define local manifold smoothness as a classifier's output sensitivity to perturbations in the manifold neighborhood around a given test point. Intuitively, a classifier that is less sensitive to these perturbations should generalize better. To estimate smoothness we sample points using data augmentation and measure the fraction of these points classified into the majority class. Our method only requires selecting a data augmentation method and makes no other assumptions about the model or data distributions, meaning it can be applied even in out-of-domain (OOD) settings where existing methods cannot. In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our manifold smoothness measure and actual OOD generalization on over 3,000 models evaluated on over 100 train/test domain pairs.  ( 3 min )
    Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons. (arXiv:2107.02397v6 [cs.LG] UPDATED)
    This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple and computable continuous activation function $\sigma$ leveraging a triangular-wave function and the softsign function. We prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the continuous function space $C([a,b]^d)$ and therefore dense in the Lebesgue spaces $L^p([a,b]^d)$ for $p\in [1,\infty)$. Furthermore, classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$, when there exist pairwise disjoint bounded closed subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset. Finally, we use numerical experimentation to show that replacing the ReLU activation function by ours would improve the experiment results.  ( 3 min )
    Improved Global Guarantees for the Nonconvex Burer--Monteiro Factorization via Rank Overparameterization. (arXiv:2207.01789v1 [math.OC])
    We consider minimizing a twice-differentiable, $L$-smooth, and $\mu$-strongly convex objective $\phi$ over an $n\times n$ positive semidefinite matrix $M\succeq0$, under the assumption that the minimizer $M^{\star}$ has low rank $r^{\star}\ll n$. Following the Burer--Monteiro approach, we instead minimize the nonconvex objective $f(X)=\phi(XX^{T})$ over a factor matrix $X$ of size $n\times r$. This substantially reduces the number of variables from $O(n^{2})$ to as few as $O(n)$ and also enforces positive semidefiniteness for free, but at the cost of giving up the convexity of the original problem. In this paper, we prove that if the search rank $r\ge r^{\star}$ is overparameterized by a constant factor with respect to the true rank $r^{\star}$, namely as in $r>\frac{1}{4}(L/\mu-1)^{2}r^{\star}$, then despite nonconvexity, local optimization is guaranteed to globally converge from any initial point to the global optimum. This significantly improves upon a previous rank overparameterization threshold of $r\ge n$, which is known to be sharp if $\phi$ is allowed to be nonsmooth and/or non-strongly convex, but would increase the number of variables back up to $O(n^{2})$. Conversely, without rank overparameterization, we prove that such a global guarantee is possible if and only if $\phi$ is almost perfectly conditioned, with a condition number of $L/\mu<3$. Therefore, we conclude that a small amount of overparameterization can lead to large improvements in theoretical guarantees for the nonconvex Burer--Monteiro factorization.
    On the Nash equilibrium of moment-matching GANs for stationary Gaussian processes. (arXiv:2203.07136v2 [stat.ML] UPDATED)
    Generative Adversarial Networks (GANs) learn an implicit generative model from data samples through a two-player game. In this paper, we study the existence of Nash equilibrium of the game which is consistent as the number of data samples grows to infinity. In a realizable setting where the goal is to estimate the ground-truth generator of a stationary Gaussian process, we show that the existence of consistent Nash equilibrium depends crucially on the choice of the discriminator family. The discriminator defined from second-order statistical moments can result in non-existence of Nash equilibrium, existence of consistent non-Nash equilibrium, or existence and uniqueness of consistent Nash equilibrium, depending on whether symmetry properties of the generator family are respected. We further study the local stability and global convergence of gradient descent-ascent methods towards consistent equilibrium.
    Modeling and Correcting Bias in Sequential Evaluation. (arXiv:2205.01607v2 [stat.ML] UPDATED)
    We consider the problem of sequential evaluation, in which an evaluator observes candidates in a sequence and assigns scores to these candidates in an online, irrevocable fashion. Motivated by the psychology literature that has studied sequential bias in such settings -- namely, dependencies between the evaluation outcome and the order in which the candidates appear -- we propose a natural model for the evaluator's rating process that captures the lack of calibration inherent to such a task. We conduct crowdsourcing experiments to demonstrate various facets of our model. We then proceed to study how to correct sequential bias under our model by posing this as a statistical inference problem. We propose a near-linear time, online algorithm for this task and prove guarantees in terms of two canonical ranking metrics. We also prove that our algorithm is information theoretically optimal, by establishing matching lower bounds in both metrics. Finally, we show that our algorithm outperforms the de facto method of using the rankings induced by the reported scores.
    What Do Graph Convolutional Neural Networks Learn?. (arXiv:2207.01839v1 [cs.LG])
    Graph neural networks (GNNs) have gained traction over the past few years for their superior performance in numerous machine learning tasks. Graph Convolutional Neural Networks (GCN) are a common variant of GNNs that are known to have high performance in semi-supervised node classification (SSNC), and work well under the assumption of homophily. Recent literature has highlighted that GCNs can achieve strong performance on heterophilous graphs under certain "special conditions". These arguments motivate us to understand why, and how, GCNs learn to perform SSNC. We find a positive correlation between similarity of latent node embeddings of nodes within a class and the performance of a GCN. Our investigation on underlying graph structures of a dataset finds that a GCN's SSNC performance is significantly influenced by the consistency and uniqueness in neighborhood structure of nodes within a class.

  • Open

    [R] Automated Taxonomic Identification of Insects with Expert-Level Accuracy Using Effective Feature Transfer from Convolutional Networks
    An interesting article in the Systematic Biology journal about identifying insects: https://academic.oup.com/sysbio/article/68/6/876/5368535 See as well: Deep learning and computer vision will transform entomology submitted by /u/1_like_science [link] [comments]  ( 85 min )
    [D] Extracting predicate to apply formal logic rules in autonomous driving dataset or CARLA simulator
    In the formal logic based autonomous driving dataset, we have a set of rules usually written in First order logic or temporal logic . But to apply the rules, we need to extract the predicate from perception system. For example, how to attach the predicate like standing_at_intersection with the perception scene obtained from AD dataset like Lyft or Argoverse or CARLA simulator. So that I can apply rules on those specific scenario. I could not find any papers or explanation, which explains how to connect the predicate in formal logic and match the connecting predicate with the dataset scene interpretation. Any help is appreciated or links to resource. submitted by /u/projekt_treadstone [link] [comments]  ( 86 min )
    [P] Reward function as a way to represent multiple targets
    I've been assigned at work a problem with multiple targets, and I've been thinking about what's the best to design a model that would optimize towards all these targets. An idea that occurred me is to create a reward function that would "encapsulate" all these targets in such a way where, the higher the reward, the better the outcome is for all the targets. In my case, it's a task distribution system where the workers have the option to decline a task if for whatever reason the task doesn't suit them, and one of my targets is to minimize the number of declines. But we also need to make sure the workload is balanced, and we are not overwhelming someone while under-utilizing the rest of the team; that would be my second target, and we can use the standard deviation as a way to measure the workload balance (the closer to 0 the std is, the better). Essentially, the targets we want to optimize towards are, reduce the number of declines, and also reduce the std of the overall task distribution. So, my reward function could be: - score 0 if the task is declined; - if the task is accepted, then I can take the delta of the std before and after. The bigger the delta, the more std was reduced, so the more even the distribution became. That way, the reward score would in a way represent both my targets (and would be the labels), and then it's simply a matter of training a regression model. Then for a new task, I predict the reward score for each task and worker, and finally assign the tasks by taking the argmax of the predicted scores. I know that rewards are popular in the RL field, but this wouldn't be necessarily a RL problem. In fact, I googled this idea but the vast majority of articles and papers covering reward functions are RL-related. I'm wondering if anyone has tried anything like this before, or have any thoughts. All comments are appreciated. submitted by /u/Travolta1984 [link] [comments]  ( 87 min )
    [P] No, we don't have to choose batch sizes as powers of 2
    Prompted by a recent discussion on social media, I did some benchmarks and wrote down my thoughts on why it doesn't really make a difference whether we choose batch sizes as powers of 2: https://sebastianraschka.com/blog/2022/batch-size-2.html What is your experience, do you do you stick to batch sizes as powers of 2 or do you choose batch sizes more freely? notice a substantial difference when you choose batch sizes as powers of 2 (or multiples of 8)? submitted by /u/seraschka [link] [comments]  ( 92 min )
    [D] How do you share a server for multiple training jobs?
    First of all, using the cloud is not a cost effective solution for us. We have an absolute beast of a server though everything grounds down to a halt when some training sessions are going on - some libraries just ignore the num_cpu settings and uses all the cpu (and even when more cores are free, everything seems to get much slower) Here's the build: 2x AMD EPYC 7763 (64 cores, 2 threads each) 2TB memory 8 RTX A6000 4TB SSD (NVMe) How do you all share a single computer resource amongst other co-workers? We have this expensive machine but when someone runs their training, others have a hard time running basic pandas operations (starting other training jobs just slows down ALL training jobs). To me, it seems like the hardware should be more than enough to run multiple training jobs concurrently. Any tips on how to use it efficiently? One solution I've been thinking was to use docker for each training job and to put hard limits on cpu / memory usage - is this something closer to best practice? submitted by /u/tadf2 [link] [comments]  ( 87 min )
    [P] Concrete dropout implementation for tensorflow 2.0
    Hello everyone! I updated the concrete dropout implementation from the original authors to work with tensorflow 2.0, tweaked the code a bit and turned it into a pip package! If you are interested, you can find it at pypi by sarching "concretedropout". There is also a link in the comments. For those of you who don't know what concrete dropout is, it's a technique which allows for the training of the dropout probability in a layer, which may save a lot of time since it removes the need to grid search for the best dropout parameters. For more information, see the original paper: arXiv:1705.07832 submitted by /u/TrPhantom8 [link] [comments]  ( 88 min )
    [D] Looking for a fast OCR repo
    Currently, we use google as our OCR service provider, but we've had already some serious issues with them and their customer support is terrible. Therefore we would like to change and move away from third-party providers in general. By now we have a sufficient amount of data to train our own OCR model, therefore I am looking for a custom fine-tunable model that is fast/accurate. I've found PaddleOCR and mmocr, but their inference speed for documents like invoices on CPU is quite slow (10s/page on my computer). I'm looking for something in the 1s/page range, similar to google's OCR. We probably don't need all the power and language knowledge these libraries provide, as we only operate on documents in mainly 4 Latin languages. Does anybody know a good starting point? submitted by /u/mkeySeraSera [link] [comments]  ( 87 min )
    [P] Using transformers for time-series forecasting
    I'm currently using different machine learning techniques on a time series and testing their forecast performance. This dataset has both an independent variable and exploratory variables. I've used LSTM on python to forecast and was searching for more recent techniques and found transformers. They seem to have been developed for NLP but have been used for time-series forecasts How well do these transformers perform and is there any resources / library I should look into? EDIT: the data I'll be using is of daily periodicity without weekends. It will have 2+ years of observations (currently working with 3 years and some other datasets have longer periods but "worse" information) submitted by /u/DoruSonic [link] [comments]  ( 90 min )
    WACV 2023 Paper Registration. [R]
    Does anyone know how to register for the WACV 2023 conference? submitted by /u/jeryyjohnson [link] [comments]  ( 85 min )
  • Open

    Tom Cruise without the power of Scientology.
    submitted by /u/cganimater [link] [comments]  ( 83 min )
    AI Dream 58 - Incredible Stellar Trip - vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 84 min )
    GitHub Copilot is the first real product of large language models
    submitted by /u/bendee983 [link] [comments]  ( 83 min )
    The four big misconceptions of AI research
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 84 min )
    Closest majors/fields to AI
    I have just graduated from school and I wanted to major in AI engineering, unfortunately, I found out that I can't, even for close majors like computer science. The only two options I have now lie between Software Engineering and Computer Engineering, which are both far from my interests, such as: AI, Machine Learning, Simulations, 3D Engines. My second plan is now to get a master's degree in AI after finishing my bachelors degree in one of these two majors I mentioned, which I later could afford on my own with their job prospects. My problem with Software Engineering is that it's too restrictive and I'm not a big fan of making software, apps and websites, but I know two friends who have already majored in it. For Computer Engineering, it does seem more interesting, the making of hardware components, but it does include Electrical Engineering, which I'm NOT a fan of. My view on majors other than AI is pretty superficial, so not claiming to have made an educated opinion on what to opt. Do I have to go with Software Engineering or Computer Engineering according to my plan and interests? I'm open for opinions, even if there's other majors than these two! submitted by /u/CATEXEBRAIN [link] [comments]  ( 88 min )
    Introducing the RAVEN MVP - a general purpose AI companion (with a live DEMO)
    submitted by /u/DavidKShapiro [link] [comments]  ( 84 min )
    The Shortest Guide to Launch Your Career The AI Way (Infographic)
    ​ This infographic shows the many ways in which AI is transforming business as well as leading job roles created by this change and top skills needed to ride the AI wave. submitted by /u/Emily-joe [link] [comments]  ( 84 min )
    Brain-Supervised Image Editing
    Brain-Supervised Image Editing Despite recent advances in deep neural models for semantic image editing, present approaches are dependent on explicit human input. Previous work assumes the availability of manually curated datasets for supervised learning, while for unsupervised approaches the human inspection of discovered components is required to identify those which modify worthwhile semantic features. Here, we present a novel alternative: the utilization of brain responses as a supervision signal for learning semantic feature representations. Participants (N=30) in a neurophysiological experiment were shown artificially generated faces and instructed to look for a particular semantic feature, such as "old" or "smiling", while their brain responses were recorded via electroencephalog…  ( 92 min )
    Why is open ended conversational AI not more popular?
    Up until recently with projects like LaMDA and BlenderBot, the area of open ended conversational AI has either been completely untouched or kept purely as research. Very few of these projects have actually been used in applications for just having a two way conversation with a user. For people in the field, why is this the case? Do companies not see a path forward with open ended dialogue systems? submitted by /u/holamyeung [link] [comments]  ( 89 min )
    still plugging Starryai
    submitted by /u/rikusorasephiroth [link] [comments]  ( 83 min )
    Hi all, every week I host AI sessions on DPhi. While all our resources are free, we create them with passion & quality. Happy to share this upcoming session on Tesla Autopilot. Would love to see you join. Link for it - https://dphi.tech/live-sessions/tesla-autopilot-ml?utm_source=reddit. 😃
    submitted by /u/muditjps [link] [comments]  ( 84 min )
  • Open

    "Watch and Match: Supercharging Imitation with Regularized Optimal Transport (ROT)", Haldar et al 2022
    submitted by /u/gwern [link] [comments]  ( 84 min )
    TRPO Practical Implementation vs Lagrangian
    Hi all, so TRPO enforces a constraint on the approximated KL divergence (which is clear to me). However, I was wondering why they solve such a constrained optimization problem using the "hard way" (i.e., a linear approx. on the objective and the quadratic one on the constraint), when they could have used a simpler Lagrangian dual. Is there any advantage in doing that over using a Lagrangian? Thanks! submitted by /u/Beautiful_Zebra_198 [link] [comments]  ( 84 min )
    Could someone help me with this question or point me to some helpful resources?
    Check, if and (if so) where in the Monte-Carlo algorithm, Temporal Difference learning (TD(0)), the Dyna-Q architecture, and R-Max a static environment is implicitly assumed. How could you modify the relevant learning methods so that they can in principle adapt to changing environments? It can be assumed that ε is sufficiently large. submitted by /u/Garbage-Shoddy [link] [comments]  ( 84 min )
    Goal-Conditionned policy on Ant Maze
    Hi, I'm trying to learn a goal conditioned policy (a policy that make an agent reach a goal, that change at each episode) on ant maze (the mujoco one). This task looks really tough, I easily learned that king of policy on a grid world but my agent (tried with SAC and DDPG) failed to find a goal-conditioned policy in d4RL. Even the DDPG code given buy D4RL authors fail to do it.Did anybody here already did it? Do you have any git repository to share with me, or any tips? Thanks a lot in advance. Links:D4RL environments: https://github.com/rail-berkeley/d4rl D4RL DDPG implementation I'm talking about: https://github.com/rail-berkeley/d4rl_evaluations/tree/master/bcq/continuous_bcq submitted by /u/hbonnavaud [link] [comments]  ( 85 min )
    Learning the CartPole so fast
    That you do not have time to get bored by watching it in real time. So it sounds like a challenge: Does any of you knows a faster learning algorithm for gym CartPole? Sorry the repository is messy, cartpole_play.py is the main file its local dependencies are sdr_util.py, sdr_value_map.py - these are all what is needed its global dependencies are numpy, numba, gym and pygame if you want rendering. A short explanation of the algorithm: after each fall, two bit pair correlation value maps are updated to chart dangerous states in its environment then picks the least dangerous action at every step. Somewhat like a Q-Table yet quite efficient since it highlights specific value correlations between different state parameters that are most significant. submitted by /u/blimpyway [link] [comments]  ( 84 min )
    Are there any human-level chess AI that doesn't use MCTS (Mont-Carlo Tree Search) ?
    From the papers I've read, it seems like all the existing methods (AlphaGo, AlphaGo Zero, MuZero, EfficientZero) uses MCTS at some point. Are there methods that doesn't perform search (i.e. directly predict action from state only, maybe with something like Policy Gradients) that have been shown to reach human level performance at chess ? submitted by /u/Lairv [link] [comments]  ( 90 min )
  • Open

    DSC Weekly 05 July 2022: Standardizing a Metaverse
    Facebook’s announcement last year about creating the Metaverse (and subsequent rebranding to Meta) kicked off a great deal of PR from the tech industry as everyone from established game companies to decentralized finance wildcats raced to plant their flag in the ground. Roughly a year has passed and in the interim the initial fervor has… Read More »DSC Weekly 05 July 2022: Standardizing a Metaverse The post DSC Weekly 05 July 2022: Standardizing a Metaverse appeared first on Data Science Central.  ( 20 min )
    Wanna become Value-driven? Time for a Culture Shift!
    I am honored to collaborate on this week’s blog with Fran Willis White, an industry expert on the role of change leadership and employee empowerment to drive cultural transformation.  In collaborating on this blog, I discovered many similarities in the role of empowerment in the data science development process to optimize business outcomes, as well… Read More »Wanna become Value-driven? Time for a Culture Shift! The post Wanna become Value-driven? Time for a Culture Shift! appeared first on Data Science Central.  ( 19 min )
    Education Trends 2022: Data Science in schools
    Data Science is a growing field that has emerged in many key areas of our world. Data Science has become a global phenomenon and has significantly improved the performance of many industries. Data Science has even incorporated education under its umbrella. Today we will be discussing the importance of data science for education & some… Read More »Education Trends 2022: Data Science in schools The post Education Trends 2022: Data Science in schools appeared first on Data Science Central.  ( 20 min )
    Databricks open sourcing delta lake is good news for AI
    Last week, Databricks open sourced all of Delta Lake (Delta Lake 2.0) to the Linux Foundation.  There is also a new release of MLflow (MLflow 2.0), which is a machine learning operations platform for management of ML pipelines.  In Databricks parlance, a Delta Lake represents a data architecture that has both storage and analytics capabilities; … Read More »Databricks open sourcing delta lake is good news for AI  The post Databricks open sourcing delta lake is good news for AI  appeared first on Data Science Central.  ( 17 min )
  • Open

    Binary Classification Tutorial with the Keras Deep Learning Library
    Keras is a Python library for deep learning that wraps the efficient numerical libraries TensorFlow and Theano. Keras allows you to quickly and simply design and train neural network and deep learning models. In this post you will discover how to effectively use the Keras library in your machine learning project by working through a […] The post Binary Classification Tutorial with the Keras Deep Learning Library appeared first on Machine Learning Mastery.  ( 54 min )
    Dropout Regularization in Deep Learning Models With Keras
    A simple and powerful regularization technique for neural networks and deep learning models is dropout. In this post you will discover the dropout regularization technique and how to apply it to your models in Python with Keras. After reading this post you will know: How the dropout regularization technique works. How to use dropout on […] The post Dropout Regularization in Deep Learning Models With Keras appeared first on Machine Learning Mastery.  ( 37 min )
  • Open

    Use Amazon SageMaker Data Wrangler in Amazon SageMaker Studio with a default lifecycle configuration
    If you use the default lifecycle configuration for your domain or user profile in Amazon SageMaker Studio and use Amazon SageMaker Data Wrangler for data preparation, then this post is for you. In this post, we show how you can create a Data Wrangler flow and use it for data preparation in a Studio environment […]  ( 8 min )
  • Open

    Watching the Watchers: Democratizing AI To Audit The State
    Socially disadvantaged communities have often raised legitimate concerns about being over-policed and under-protected. Now, the rise of AI…  ( 12 min )
    How Much Does an AI Solution Cost?
    Since a customized AI solution is always individual, no one can give you a general cost estimate.  ( 8 min )
  • Open

    Computer Graphics Artist Xueguo Yang Shares Fractal Art Series This Week ‘In the NVIDIA Studio’
    Putting art, mathematics and computers together in the mid-1980s created a new genre of digital media: fractal art. In the NVIDIA Studio this week, computer graphics (CG) artist, educator and curator Xueguo Yang shares his insights behind fractal art — which uses algorithms to artistically represent calculations derived from geometric objects as digital images and animations. The post Computer Graphics Artist Xueguo Yang Shares Fractal Art Series This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 7 min )

  • Open

    “Japanese Samurai”
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    Researchers from George Mason and Emory University Develop ‘RES’: a Robust Python Framework for Learning to Explain DNNs (Deep Neural Networks) with Explanation Supervision
    The study on explainability or explainable AI is currently receiving a lot of attention as DNNs become accessible in a variety of application domains. Many explainability techniques that attempt to provide the local explanation of the DNNs prediction for a particular instance, such as techniques that provide saliency maps for understanding which sub-parts in an instance are most responsible for the model prediction, have been proposed in an effort to open the black box of DNNs. While local explanation techniques have seen a rapid growth in research in recent years, the majority of attention has been placed on handling the generation of explanations rather than understanding whether the explanations are accurate or reasonable, what to do if they are, and how to modify the model to produce more accurate or reasonable explanations. Continue reading | Checkout the paper and github submitted by /u/ai-lover [link] [comments]  ( 84 min )
    Researchers at Stanford have developed an Artificial Intelligence (AI) model, EG3D, that can generate random images of faces and other objects with high resolution together with underlying geometric structures
    Artificially intelligent models have recently advanced to the point that users will soon be able to utilize these models to immediately construct and alter nearly photorealistic three-dimensional sceneries from the comfort of their laptops. Since these technologies make it simple to generate hyperrealistic avatars, they will revolutionize the way artists working on video games and CGI for movies approach their work. For quite some time, AIs have been able to create realistic 2D images. However, 3D scenarios have proven to be more challenging due to the enormous computer power needed. The AI model EG3D, created by a team of Stanford academics, can be used to produce random high-resolution images of faces and other things having an underlying geometric structure. This model is one of the first 3D models now in use to reach rendering quality close to photorealism. Continue reading | Checkout the paper, github submitted by /u/ai-lover [link] [comments]  ( 84 min )
    AI Researchers deserve a Nobel Prize!
    Why is there a Nobel Prize? The Nobel Prize was set up when businessman and entrepreneur Alfred Nobel died and left the majority of his fortune to the establishment of prizes in physics, chemistry, physiology or medicine, literature and peace. His will stated that the prizes should be awarded to “those who, during the preceding year, shall have conferred the greatest benefit to humankind.” [source: https://www.nobelprize.org ] I think AI technologies are everywhere; in physics, chemistry, physiology, medicine, literature, peace, etc. The Nobel Peace Prize foundation should dedicate a prize to AI researchers who invent technologies that change human life. Who agrees? submitted by /u/aymenSekhri [link] [comments]  ( 83 min )
    Is General Intelligence "Compact"? | LessWrong
    submitted by /u/DragonGod2718 [link] [comments]  ( 83 min )
    Bringing Python to Browser for Doing Image Processing
    Ever wondered could we learn python in the browser and run machine learning apps?. Recently I came to know about PyScript which can be used to run python in the browser. Still, I couldn't find a single example or post which demonstrates image processing using PyScript hence I decided to figure it out myself, and create one and share it with the community. The article doing the same can be checked from link below: https://blog.devgenius.io/bringing-python-to-browser-for-doing-image-processing-c34f5bba9c1d submitted by /u/VikasOjha666 [link] [comments]  ( 83 min )
    DARK TEMPLES ESCAPADE | 4K DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 83 min )
    The first really "scary" Windows bot will be just automatically click any X in the top right corner of any new window.
    submitted by /u/OmitsWordsByAccident [link] [comments]  ( 83 min )
    Implementing Simple Neural Network in C#
    submitted by /u/RubiksCodeNMZ [link] [comments]  ( 83 min )
    NBA Teams Mug Rugs, (Coasters) Which one is the best?
    submitted by /u/aysheshandmade [link] [comments]  ( 82 min )
    6 Best Artificial Intelligence courses for Healthcare You should learn 2022
    https://codingvidya.com/best-artificial-intelligence-courses-for-healthcare/ submitted by /u/Lakshmireddys [link] [comments]  ( 82 min )
  • Open

    [P] Poniard: a companion library for scikit-learn that helps with model evaluation and comparison
    TL;DR: Check out Poniard, a new Python library that helps with machine learning model evaluation. You can go ahead and install with pip. Links to source code and documentation at the end of this post. ----- For the past few months I've been working on Poniard, a Python library that streamlines ML model evaluation and comparison, built on top of scikit-learn. In a nutshell, load some data, select some models, some metrics and a cross-validation strategy, and go to town. Poniard tries to have a small footprint, a simple API and sane defaults. But above all it strives to have the user stay in control of their modeling experience; you should always know what's going on. This deliberately is NOT an AutoML tool When I started this project I was trying to speed up a very uninteresting process, i.e., loop through multiple estimators and arrive at a list of metrics for comparison. On the way I included easy hyperparameter tuning, plotting, an extensible plugin framework (out of the box includes Weights and Biases and Pandas Profiling) and as much as I could to make the experience simple and transparent. Poniard is not exactly groundbreaking, and there are projects in a similar vein that do so much more. In contrast, they tend to have a more complicated API and more dependencies which are some of the things that I actively tried to avoid. Github Example notebooks (including Colab links) Documentation PyPI submitted by /u/rafa10pj [link] [comments]  ( 85 min )
    [P] Bulk AI Text Generation (No/Low code)
    https://textgenerator.app.nz/bulk-text-generator You can upload a CSV and get lots of Text Generated, works in many languages and code too. There's also an API. The main selling points (VS OpenAI who is the main competitor) Works faster Currie/Babbage quality, but also works across languages/code without needing to specify what model Massively cheaper pricing/huge cost savings :) easier to control can specify max_sentences to make it generate up to a specific number of sentences) can specify min_probability to make it generate the next few likely words to do autocomplete for code/writing I originally created https://textgenerator.app.nz/ as a API for developers primarily but the bulk generator now allows non technical types to pre generate a lot of variety too/branching stories/games/marketing content/code/summaries/ etc. There's actually a massive amount of use cases that one will never be able to understand which is exciting too. submitted by /u/leepenkman [link] [comments]  ( 86 min )
    [D] Backpropagating from GPT-2's output
    I am working on a research project about controllable generation with GPT. I am stuck, so I hope you are able to help me out. I will try to explain the issue as clear as possible, so bear with me. The approach I am pursuing right now is adding a frozen classifier on top of gpt that should steer the model in generating the right class, which is a grammatical property of the generated output. However, the autoregressive nature of GPT complicates things a bit. I cannot simply backpropagate through the generation process (greedy / beam search). I tried adding the classifier on the last input token to avoid the generation process but unsurprisingly this does not yield sufficient performance. How would you tackle this? Is it even feasible? submitted by /u/_Arsenie_Boca_ [link] [comments]  ( 85 min )
    [D] Does anyone here use Google's seqio library?
    In my research I recently came across this library from google: seqio: Task-based datasets, preprocessing, and evaluation for sequence models. From the citation it seems it was released jointly with another library from google, t5x. From the paper and the docs, it sounds quite similar to huggingface's datasets library, albeit perhaps slightly more opinionated. I was hoping to find a more thorough comparison with pre-existing dataloading/processing libraries but couldn't find one (they mostly focus on t5x in the paper). Has anyone here used it? What was your experience? To me it seems a bit redundant but I haven't been able to take a deeper dive Thanks :) submitted by /u/thesofakillers [link] [comments]  ( 84 min )
    [D] How do you share big datasets with your team and others?
    Looking for a bit of a discussion. I'm wondering how you collaborate on data... i.e. how do you share big datasets with data scientists/engineers, within and outside of your team? Do you just push it into a simple DB, do you upload it to Kaggle (if non-sensitive) or via Google Drive/OneDrive? What if the dataset gets updated frequently? I'm working with a customer and sharing data is a bit of a pain. submitted by /u/dmart89 [link] [comments]  ( 91 min )
    [R] Masking for Representation Learning in Vision
    A blog about representation learning from masked images, what makes a good mask, and how to learn such masks: https://akosiorek.github.io/ml/2022/07/04/masking_repr_learning_vision.html. Based on a recent ICML paper: Shi et. al, "Adversarial Masking for Self-Supervised Learning", ICML 2022. submitted by /u/ErrorDry4380 [link] [comments]  ( 84 min )
    [P] Feathr - An Open-Source, Enterprise-Grade and High-Performance Feature Store
    Hi everyone! We are engineers from Microsoft/LinkedIn, and we released an open-source Feature Store called Feathr a few weeks ago (https://github.com/linkedin/feathr). It has many highlights like below. Feel free to check out the repository and let us know if there are any questions! We also have a few blogposts and recordings in case folks want to learn a bit more about it: Open Sourcing Feathr Feathr on Azure. Tech talks on Feathr And its highlights include (more highlights are here): Battle tested in production for more than 6 years: LinkedIn has been using Feathr in production for over 6 years and have a dedicated team improving it. Scalable with built-in optimizations: For example, based on some internal use case, Feathr can process billions of rows and PB scale data with built-in optimizations such as bloom filters and salted joins. Rich support for point-in-time joins and aggregations: Feathr has high performant built-in operators designed for Feature Store, including time-based aggregation, sliding window joins, look-up features, all with point-in-time correctness. Derived Features and centralized Feature Registry which encourage feature consumers to build features on existing features and encouraging feature reuse. ​ Screenshots for the Feathr UI: https://preview.redd.it/3fri2r3qoi991.png?width=3584&format=png&auto=webp&s=5dfe14233b2a8805c50bedd5bfed4bbb31bd0654 submitted by /u/zxzxy1988 [link] [comments]  ( 86 min )
    [D] Is there any deep learning algorithm based on divide and conquer?
    Dealing with a very huge data, eg. very long video datasets, the problems are long training time. Most of technics are using distributed deep learning to solve the problem robustly. I have an idea that we divide the dataset into small sets and train a model. After that using the model to predict values as features, put them into another model and train a second model to predict the output. Like divide and conquer but here is divide the dataset, train a model and conquer the prediction results into one. I have done some research in the internet about deep learning algorithm based on divide and conquer but seems not so many articles about it. Is it a correct to think in this way? Does anyone know any paper about this? Thank you so much. submitted by /u/tmclouisluk [link] [comments]  ( 91 min )
    [D] Which U.S. universities are actively studying generative models?
    Although there are university rankings such as us news, it is difficult to find the universities that are good at a specific field one is interested in. We all know that Stanford and Berkeley are good at generative models, but what else? Please give me the name of university (+ the name of professor if possible) and the paper they published. It would be meaningful especially if the university is not very famous and their paper is outstanding. submitted by /u/SnooPandas3529 [link] [comments]  ( 85 min )
  • Open

    "Remaking EfficientZero (as best I can)", Hoagy (experiences implementing Muzero)
    submitted by /u/gwern [link] [comments]  ( 83 min )
    [ReReading Reinforcment Learning by Sutton and Barton] Chapter 2 - Multi-armed Bandits
    Here's the update for week 2 of reading the book! ​ This weeks reading is also quite short with 17 pages, that's barely 2.5 pages per day! The chapter covers the first basic concepts and the gradient bandit algorithms. As far as live discussions goes, a Discord Server has been created just for this purpose. See https://discord.gg/Juafpk23 (Thanks u/duh619 for creating the channel). Use Discord at your own discretion though. (https://spyware.neocities.org/articles/discord.html) The plan is to have weekly discussions - the search for a common time slot is ongoing until tomorrow night. ​ To supplement your reading, you can find summaries of the chapters on Youtube: https://www.youtube.com/watch?v=4SLGEq_HZxk&list=PLnn6VZp3hqNvRrdnMOVtgV64F_O-61C1D&index=1 (Thanks to u/taplik_to_rehvani for pointing this out). ​ Happy Reading, hope to see some comments discussing questions and ideas of this weeks chapter! submitted by /u/Accomplished-Ninja31 [link] [comments]  ( 83 min )
    RL with differentiable environment
    So bear with me here, I have some experience in other types of ML, but I don't really know much about RL. I have a problem where I want to use a neutral network to see some history of inputs, choose a set of parameters, and then that set of parameters modifies a simulation that eventually spits back a loss. This is all a time series, so those losses can either be viewed per sample or be batched up in some way. Anyway, to me it seems RL in general has to deal with interacting with some big unknown external system (the "environment"). However, in my scenario, that simulation is actually a relatively straightforward algorithm that I've already implemented in PyTorch and is differentiable. Does this buy me anything that "normal RL" has to hack its way around? Any insights here are greatly appreciated. Thanks in advance. submitted by /u/saw79 [link] [comments]  ( 85 min )
    Add noise in State Space
    I am wondering how it makes sense to add noise in state-space during the training process at random times. submitted by /u/Mariam_Dundua [link] [comments]  ( 83 min )
  • Open

    Using Activation Functions in Neural Networks
    Activation functions play an integral role in neural networks by introducing non-linearity. This nonlinearity allows neural networks to develop complex representations and functions based on the inputs that would not be possible with a simple linear regression model. There have been many different non-linear activation functions proposed throughout the history of neural networks. In this […] The post Using Activation Functions in Neural Networks appeared first on Machine Learning Mastery.  ( 17 min )
  • Open

    Measuring Forgetting of Memorized Training Examples. (arXiv:2207.00099v1 [cs.LG])
    Machine learning models exhibit two seemingly contradictory phenomena: training data memorization and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what extent models ``forget'' the specifics of training examples, becoming less susceptible to privacy attacks on examples they have not seen recently. We show that, while non-convexity can prevent forgetting from happening in the worst-case, standard image and speech models empirically do forget examples over time. We identify nondeterminism as a potential explanation, showing that deterministically trained models do not forget. Our results suggest that examples seen early when training with extremely large datasets -- for instance those examples used to pre-train a model -- may observe privacy benefits at the expense of examples seen later.  ( 2 min )
    Community detection and percolation of information in a geometric setting. (arXiv:2006.15574v2 [stat.ML] UPDATED)
    We make the first steps towards generalizing the theory of stochastic block models, in the sparse regime, towards a model where the discrete community structure is replaced by an underlying geometry. We consider a geometric random graph over a homogeneous metric space where the probability of two vertices to be connected is an arbitrary function of the distance. We give sufficient conditions under which the locations can be recovered (up to an isomorphism of the space) in the sparse regime. Moreover, we define a geometric counterpart of the model of flow of information on trees, due to Mossel and Peres, in which one considers a branching random walk on a sphere and the goal is to recover the location of the root based on the locations of leaves. We give some sufficient conditions for percolation and for non-percolation of information in this model.  ( 2 min )
    PROTOtypical Logic Tensor Networks (PROTO-LTN) for Zero Shot Learning. (arXiv:2207.00433v1 [cs.CV])
    Semantic image interpretation can vastly benefit from approaches that combine sub-symbolic distributed representation learning with the capability to reason at a higher level of abstraction. Logic Tensor Networks (LTNs) are a class of neuro-symbolic systems based on a differentiable, first-order logic grounded into a deep neural network. LTNs replace the classical concept of training set with a knowledge base of fuzzy logical axioms. By defining a set of differentiable operators to approximate the role of connectives, predicates, functions and quantifiers, a loss function is automatically specified so that LTNs can learn to satisfy the knowledge base. We focus here on the subsumption or \texttt{isOfClass} predicate, which is fundamental to encode most semantic image interpretation tasks. Unlike conventional LTNs, which rely on a separate predicate for each class (e.g., dog, cat), each with its own set of learnable weights, we propose a common \texttt{isOfClass} predicate, whose level of truth is a function of the distance between an object embedding and the corresponding class prototype. The PROTOtypical Logic Tensor Networks (PROTO-LTN) extend the current formulation by grounding abstract concepts as parametrized class prototypes in a high-dimensional embedding space, while reducing the number of parameters required to ground the knowledge base. We show how this architecture can be effectively trained in the few and zero-shot learning scenarios. Experiments on Generalized Zero Shot Learning benchmarks validate the proposed implementation as a competitive alternative to traditional embedding-based approaches. The proposed formulation opens up new opportunities in zero shot learning settings, as the LTN formalism allows to integrate background knowledge in the form of logical axioms to compensate for the lack of labelled examples.  ( 3 min )
    Distributed Influence-Augmented Local Simulators for Parallel MARL in Large Networked Systems. (arXiv:2207.00288v1 [cs.LG])
    Due to its high sample complexity, simulation is, as of today, critical for the successful application of reinforcement learning. Many real-world problems, however, exhibit overly complex dynamics, which makes their full-scale simulation computationally slow. In this paper, we show how to decompose large networked systems of many agents into multiple local components such that we can build separate simulators that run independently and in parallel. To monitor the influence that the different local components exert on one another, each of these simulators is equipped with a learned model that is periodically trained on real trajectories. Our empirical results reveal that distributing the simulation among different processes not only makes it possible to train large multi-agent systems in just a few hours but also helps mitigate the negative effects of simultaneous learning.  ( 2 min )
    A Deep-Learning-Aided Pipeline for Efficient Post-Silicon Tuning. (arXiv:2207.00336v1 [cs.LG])
    In post-silicon validation, tuning is to find the values for the tuning knobs, potentially as a function of process parameters and/or known operating conditions. In this sense, an more efficient tuning requires identifying the most critical tuning knobs and process parameters in terms of a given figure-of-merit for a Device Under Test (DUT). This is often manually conducted by experienced experts. However, with increasingly complex chips, manual inspection on a large amount of raw variables has become more challenging. In this work, we leverage neural networks to efficiently select the most relevant variables and present a corresponding deep-learning-aided pipeline for efficient tuning.  ( 2 min )
    Improving Disease Classification Performance and Explainability of Deep Learning Models in Radiology with Heatmap Generators. (arXiv:2207.00157v1 [eess.IV])
    As deep learning is widely used in the radiology field, the explainability of such models is increasingly becoming essential to gain clinicians' trust when using the models for diagnosis. In this research, three experiment sets were conducted with a U-Net architecture to improve the classification performance while enhancing the heatmaps corresponding to the model's focus through incorporating heatmap generators during training. All of the experiments used the dataset that contained chest radiographs, associated labels from one of the three conditions ("normal", "congestive heart failure (CHF)", and "pneumonia"), and numerical information regarding a radiologist's eye-gaze coordinates on the images. The paper (A. Karargyris and Moradi, 2021) that introduced this dataset developed a U-Net model, which was treated as the baseline model for this research, to show how the eye-gaze data can be used in multi-modal training for explainability improvement. To compare the classification performances, the 95% confidence intervals (CI) of the area under the receiver operating characteristic curve (AUC) were measured. The best method achieved an AUC of 0.913 (CI: 0.860-0.966). The greatest improvements were for the "pneumonia" and "CHF" classes, which the baseline model struggled most to classify, resulting in AUCs of 0.859 (CI: 0.732-0.957) and 0.962 (CI: 0.933-0.989), respectively. The proposed method's decoder was also able to produce probability masks that highlight the determining image parts in model classifications, similarly as the radiologist's eye-gaze data. Hence, this work showed that incorporating heatmap generators and eye-gaze information into training can simultaneously improve disease classification and provide explainable visuals that align well with how the radiologist viewed the chest radiographs when making diagnosis.  ( 3 min )
    Agent with Tangent-based Formulation and Anatomical Perception for Standard Plane Localization in 3D Ultrasound. (arXiv:2207.00475v1 [cs.CV])
    Standard plane (SP) localization is essential in routine clinical ultrasound (US) diagnosis. Compared to 2D US, 3D US can acquire multiple view planes in one scan and provide complete anatomy with the addition of coronal plane. However, manually navigating SPs in 3D US is laborious and biased due to the orientation variability and huge search space. In this study, we introduce a novel reinforcement learning (RL) framework for automatic SP localization in 3D US. Our contribution is three-fold. First, we formulate SP localization in 3D US as a tangent-point-based problem in RL to restructure the action space and significantly reduce the search space. Second, we design an auxiliary task learning strategy to enhance the model's ability to recognize subtle differences crossing Non-SPs and SPs in plane search. Finally, we propose a spatial-anatomical reward to effectively guide learning trajectories by exploiting spatial and anatomical information simultaneously. We explore the efficacy of our approach on localizing four SPs on uterus and fetal brain datasets. The experiments indicate that our approach achieves a high localization accuracy as well as robust performance.  ( 3 min )
    Using Machine Learning to Anticipate Tipping Points and Extrapolate to Post-Tipping Dynamics of Non-Stationary Dynamical Systems. (arXiv:2207.00521v1 [cs.LG])
    In this paper we consider the machine learning (ML) task of predicting tipping point transitions and long-term post-tipping-point behavior associated with the time evolution of an unknown (or partially unknown), non-stationary, potentially noisy and chaotic, dynamical system. We focus on the particularly challenging situation where the past dynamical state time series that is available for ML training predominantly lies in a restricted region of the state space, while the behavior to be predicted evolves on a larger state space set not fully observed by the ML model during training. In this situation, it is required that the ML prediction system have the ability to extrapolate to different dynamics past that which is observed during training. We investigate the extent to which ML methods are capable of accomplishing useful results for this task, as well as conditions under which they fail. In general, we found that the ML methods were surprisingly effective even in situations that were extremely challenging, but do (as one would expect) fail when ``too much" extrapolation is required. For the latter case, we investigate the effectiveness of combining the ML approach with conventional modeling based on scientific knowledge, thus forming a hybrid prediction system which we find can enable useful prediction even when its ML-based and knowledge-based components fail when acting alone. We also found that achieving useful results may require using very carefully selected ML hyperparameters and we propose a hyperparameter optimization strategy to address this problem. The main conclusion of this paper is that ML-based approaches are promising tools for predicting the behavior of non-stationary dynamical systems even in the case where the future evolution (perhaps due to the crossing of a tipping point) includes dynamics on a set outside of that explored by the training data.  ( 3 min )
    Continual Learning for Human State Monitoring. (arXiv:2207.00010v1 [cs.LG])
    Continual Learning (CL) on time series data represents a promising but under-studied avenue for real-world applications. We propose two new CL benchmarks for Human State Monitoring. We carefully designed the benchmarks to mirror real-world environments in which new subjects are continuously added. We conducted an empirical evaluation to assess the ability of popular CL strategies to mitigate forgetting in our benchmarks. Our results show that, possibly due to the domain-incremental properties of our benchmarks, forgetting can be easily tackled even with a simple finetuning and that existing strategies struggle in accumulating knowledge over a fixed, held-out, test subject.  ( 2 min )
    e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce. (arXiv:2207.00208v1 [cs.LG])
    Understanding vision and language representations of product content is vital for search and recommendation applications in e-commerce. As a backbone for online shopping platforms and inspired by the recent success in representation learning research, we propose a contrastive learning framework that aligns language and visual models using unlabeled raw product text and images. We present techniques we used to train large-scale representation learning models and share solutions that address domain-specific challenges. We study the performance using our pre-trained model as backbones for diverse downstream tasks, including category classification, attribute extraction, product matching, product clustering, and adult product recognition. Experimental results show that our proposed method outperforms the baseline in each downstream task regarding both single modality and multiple modalities.  ( 2 min )
    Online Reflective Learning for Robust Medical Image Segmentation. (arXiv:2207.00476v1 [cs.CV])
    Deep segmentation models often face the failure risks when the testing image presents unseen distributions. Improving model robustness against these risks is crucial for the large-scale clinical application of deep models. In this study, inspired by human learning cycle, we propose a novel online reflective learning framework (RefSeg) to improve segmentation robustness. Based on the reflection-on-action conception, our RefSeg firstly drives the deep model to take action to obtain semantic segmentation. Then, RefSeg triggers the model to reflect itself. Because making deep models realize their segmentation failures during testing is challenging, RefSeg synthesizes a realistic proxy image from the semantic mask to help deep models build intuitive and effective reflections. This proxy translates and emphasizes the segmentation flaws. By maximizing the structural similarity between the raw input and the proxy, the reflection-on-action loop is closed with segmentation robustness improved. RefSeg runs in the testing phase and is general for segmentation models. Extensive validation on three medical image segmentation tasks with a public cardiac MR dataset and two in-house large ultrasound datasets show that our RefSeg remarkably improves model robustness and reports state-of-the-art performance over strong competitors.  ( 2 min )
    Video + CLIP Baseline for Ego4D Long-term Action Anticipation. (arXiv:2207.00579v1 [cs.CV])
    In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.  ( 2 min )
    Autonomous Intraluminal Navigation of a Soft Robot using Deep-Learning-based Visual Servoing. (arXiv:2207.00401v1 [cs.RO])
    Navigation inside luminal organs is an arduous task that requires non-intuitive coordination between the movement of the operator's hand and the information obtained from the endoscopic video. The development of tools to automate certain tasks could alleviate the physical and mental load of doctors during interventions, allowing them to focus on diagnosis and decision-making tasks. In this paper, we present a synergic solution for intraluminal navigation consisting of a 3D printed endoscopic soft robot that can move safely inside luminal structures. Visual servoing, based on Convolutional Neural Networks (CNNs) is used to achieve the autonomous navigation task. The CNN is trained with phantoms and in-vivo data to segment the lumen, and a model-less approach is presented to control the movement in constrained environments. The proposed robot is validated in anatomical phantoms in different path configurations. We analyze the movement of the robot using different metrics such as task completion time, smoothness, error in the steady-state, and mean and maximum error. We show that our method is suitable to navigate safely in hollow environments and conditions which are different than the ones the network was originally trained on.  ( 3 min )
    Automatic Evaluation of Speaker Similarity. (arXiv:2207.00344v1 [cs.SD])
    We introduce a new automatic evaluation method for speaker similarity assessment, that is consistent with human perceptual scores. Modern neural text-to-speech models require a vast amount of clean training data, which is why many solutions switch from single speaker models to solutions trained on examples from many different speakers. Multi-speaker models bring new possibilities, such as a faster creation of new voices, but also a new problem - speaker leakage, where the speaker identity of a synthesized example might not match those of the target speaker. Currently, the only way to discover this issue is through costly perceptual evaluations. In this work, we propose an automatic method for assessment of speaker similarity. For that purpose, we extend the recent work on speaker verification systems and evaluate how different metrics and speaker embeddings models reflect Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) scores. Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.  ( 2 min )
    A Multi-stage Framework with Mean Subspace Computation and Recursive Feedback for Online Unsupervised Domain Adaptation. (arXiv:2207.00003v1 [cs.LG])
    In this paper, we address the Online Unsupervised Domain Adaptation (OUDA) problem and propose a novel multi-stage framework to solve real-world situations when the target data are unlabeled and arriving online sequentially in batches. To project the data from the source and the target domains to a common subspace and manipulate the projected data in real-time, our proposed framework institutes a novel method, called an Incremental Computation of Mean-Subspace (ICMS) technique, which computes an approximation of mean-target subspace on a Grassmann manifold and is proven to be a close approximate to the Karcher mean. Furthermore, the transformation matrix computed from the mean-target subspace is applied to the next target data in the recursive-feedback stage, aligning the target data closer to the source domain. The computation of transformation matrix and the prediction of next-target subspace leverage the performance of the recursive-feedback stage by considering the cumulative temporal dependency among the flow of the target subspace on the Grassmann manifold. The labels of the transformed target data are predicted by the pre-trained source classifier, then the classifier is updated by the transformed data and predicted labels. Extensive experiments on six datasets were conducted to investigate in depth the effect and contribution of each stage in our proposed framework and its performance over previous approaches in terms of classification accuracy and computational speed. In addition, the experiments on traditional manifold-based learning models and neural-network-based learning models demonstrated the applicability of our proposed framework for various types of learning models.  ( 3 min )
    When Does Differentially Private Learning Not Suffer in High Dimensions?. (arXiv:2207.00160v1 [cs.LG])
    Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term restricted Lipschitz continuity and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients evaluated near a local optimum are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning.  ( 2 min )
    Studying the impact of magnitude pruning on contrastive learning methods. (arXiv:2207.00200v1 [cs.LG])
    We study the impact of different pruning techniques on the representation learned by deep neural networks trained with contrastive loss functions. Our work finds that at high sparsity levels, contrastive learning results in a higher number of misclassified examples relative to models trained with traditional cross-entropy loss. To understand this pronounced difference, we use metrics such as the number of PIEs (Hooker et al., 2019), Q-Score (Kalibhat et al., 2022), and PD-Score (Baldock et al., 2021) to measure the impact of pruning on the learned representation quality. Our analysis suggests the schedule of the pruning method implementation matters. We find that the negative impact of sparsity on the quality of the learned representation is the highest when pruning is introduced early on in the training phase.  ( 2 min )
    DP$^2$-NILM: A Distributed and Privacy-preserving Framework for Non-intrusive Load Monitoring. (arXiv:2207.00041v1 [cs.LG])
    Non-intrusive load monitoring (NILM), which usually utilizes machine learning methods and is effective in disaggregating smart meter readings from the household-level into appliance-level consumption, can help analyze electricity consumption behaviours of users and enable practical smart energy and smart grid applications. Recent studies have proposed many novel NILM frameworks based on federated deep learning (FL). However, there lacks comprehensive research exploring the utility optimization schemes and the privacy-preserving schemes in different FL-based NILM application scenarios. In this paper, we make the first attempt to conduct FL-based NILM focusing on both the utility optimization and the privacy-preserving by developing a distributed and privacy-preserving NILM (DP2-NILM) framework and carrying out comparative experiments on practical NILM scenarios based on real-world smart meter datasets. Specifically, two alternative federated learning strategies are examined in the utility optimization schemes, i.e., the FedAvg and the FedProx. Moreover, different levels of privacy guarantees, i.e., the local differential privacy federated learning and the global differential privacy federated learning are provided in the DP2-NILM. Extensive comparison experiments are conducted on three real-world datasets to evaluate the proposed framework.  ( 2 min )
    Analysis of Kinetic Models for Label Switching and Stochastic Gradient Descent. (arXiv:2207.00389v1 [math.AP])
    In this paper we provide a novel approach to the analysis of kinetic models for label switching, which are used for particle systems that can randomly switch between gradient flows in different energy landscapes. Besides problems in biology and physics, we also demonstrate that stochastic gradient descent, the most popular technique in machine learning, can be understood in this setting, when considering a time-continuous variant. Our analysis is focusing on the case of evolution in a collection of external potentials, for which we provide analytical and numerical results about the evolution as well as the stationary problem.  ( 2 min )
    More is Better (Mostly): On the Backdoor Attacks in Federated Graph Neural Networks. (arXiv:2202.03195v3 [cs.CR] UPDATED)
    Graph Neural Networks (GNNs) are a class of deep learning-based methods for processing graph domain information. GNNs have recently become a widely used graph analysis method due to their superior ability to learn representations for complex graph data. However, due to privacy concerns and regulation restrictions, centralized GNNs can be difficult to apply to data-sensitive scenarios. Federated learning (FL) is an emerging technology developed for privacy-preserving settings when several parties need to train a shared global model collaboratively. Although several research works have applied FL to train GNNs (Federated GNNs), there is no research on their robustness to backdoor attacks. This paper bridges this gap by conducting two types of backdoor attacks in Federated GNNs: centralized backdoor attacks (CBA) and distributed backdoor attacks (DBA). Our experiments show that the DBA attack success rate is higher than CBA in almost all evaluated cases. For CBA, the attack success rate of all local triggers is similar to the global trigger even if the training set of the adversarial party is embedded with the global trigger. To further explore the properties of two backdoor attacks in Federated GNNs, we evaluate the attack performance for a different number of clients, trigger sizes, poisoning intensities, and trigger densities. Moreover, we explore the robustness of DBA and CBA against two state-of-the-art defenses. We find that both attacks are robust against the investigated defenses, necessitating the need to consider backdoor attacks in Federated GNNs as a novel threat that requires custom defenses.  ( 3 min )
    Lifelong Inverse Reinforcement Learning. (arXiv:2207.00461v1 [cs.LG])
    Methods for learning from demonstration (LfD) have shown success in acquiring behavior policies by imitating a user. However, even for a single task, LfD may require numerous demonstrations. For versatile agents that must learn many tasks via demonstration, this process would substantially burden the user if each task were learned in isolation. To address this challenge, we introduce the novel problem of lifelong learning from demonstration, which allows the agent to continually build upon knowledge learned from previously demonstrated tasks to accelerate the learning of new tasks, reducing the amount of demonstrations required. As one solution to this problem, we propose the first lifelong learning approach to inverse reinforcement learning, which learns consecutive tasks via demonstration, continually transferring knowledge between tasks to improve performance.  ( 2 min )
    Watermarking Graph Neural Networks based on Backdoor Attacks. (arXiv:2110.11024v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved promising performance in various real-world applications. Building a powerful GNN model is not a trivial task, as it requires a large amount of training data, powerful computing resources, and human expertise in fine-tuning the model. What is more, with the development of adversarial attacks, e.g., model stealing attacks, GNNs raise challenges to model authentication. To avoid copyright infringement on GNNs, it is necessary to verify the ownership of the GNN models. In this paper, we present a watermarking framework for GNNs for both graph and node classification tasks. We 1) design two strategies to generate watermarked data for the graph classification task and one for the node classification task, 2) embed the watermark into the host model through training to obtain the watermarked GNN model, and 3) verify the ownership of the suspicious model in a black-box setting. The experiments show that our framework can verify the ownership of GNN models with a very high probability (around $95\%$) for both tasks. Finally, we experimentally show that our watermarking approach is robust against two model modifications and an input reformation defense against backdoor attacks.  ( 3 min )
    Characterizing the Effect of Class Imbalance on the Learning Dynamics. (arXiv:2207.00391v1 [stat.ML])
    Data imbalance is a common problem in the machine learning literature that can have a critical effect on the performance of a model. Various solutions exist - such as the ones that focus on resampling or data generation - but their impact on the convergence of gradient-based optimizers used in deep learning is not understood. We here elucidate the significant negative impact of data imbalance on learning, showing that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based optimizer. The reason is not only that the gradient signal neglects the minority classes, but also that the minority classes are subject to a larger directional noise, which slows their learning by an amount related to the imbalance ratio. To address this problem, we propose a new algorithmic solution, for which we provide a detailed analysis of its convergence behavior. We show both theoretically and empirically that this new algorithm exhibits a better behavior with more stable learning curves for each class, as well as a better generalization performance.  ( 2 min )
    A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks. (arXiv:2111.04949v2 [cs.LG] UPDATED)
    The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields. This phenomenon has spurred the development of algorithms for distributed training of neural networks over a larger number of hardware accelerators. In this paper, we discuss and compare current state-of-the-art frameworks for large scale distributed deep learning. First, we survey current practices in distributed learning and identify the different types of parallelism used. Then, we present empirical results comparing their performance on large image and language training tasks. Additionally, we address their statistical efficiency and memory consumption behavior. Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.  ( 2 min )
    A Random Persistence Diagram Generator. (arXiv:2104.07737v3 [stat.ML] UPDATED)
    Topological data analysis (TDA) studies the shape patterns of data. Persistent homology is a widely used method in TDA that summarizes homological features of data at multiple scales and stores them in persistence diagrams (PDs). In this paper, we propose a random persistence diagram generator (RPDG) method that generates a sequence of random PDs from the ones produced by the data. RPDG is underpinned by a model based on pairwise interacting point processes, and a reversible jump Markov chain Monte Carlo (RJ-MCMC) algorithm. A first example, which is based on a synthetic dataset, demonstrates the efficacy of RPDG and provides a comparison with another method for sampling PDs. A second example demonstrates the utility of RPDG to solve a materials science problem given a real dataset of small sample size.  ( 2 min )
    LBDMIDS: LSTM Based Deep Learning Model for Intrusion Detection Systems for IoT Networks. (arXiv:2207.00424v1 [cs.CR])
    In the recent years, we have witnessed a huge growth in the number of Internet of Things (IoT) and edge devices being used in our everyday activities. This demands the security of these devices from cyber attacks to be improved to protect its users. For years, Machine Learning (ML) techniques have been used to develop Network Intrusion Detection Systems (NIDS) with the aim of increasing their reliability/robustness. Among the earlier ML techniques DT performed well. In the recent years, Deep Learning (DL) techniques have been used in an attempt to build more reliable systems. In this paper, a Deep Learning enabled Long Short Term Memory (LSTM) Autoencoder and a 13-feature Deep Neural Network (DNN) models were developed which performed a lot better in terms of accuracy on UNSW-NB15 and Bot-IoT datsets. Hence we proposed LBDMIDS, where we developed NIDS models based on variants of LSTMs namely, stacked LSTM and bidirectional LSTM and validated their performance on the UNSW\_NB15 and BoT\-IoT datasets. This paper concludes that these variants in LBDMIDS outperform classic ML techniques and perform similarly to the DNN models that have been suggested in the past.  ( 2 min )
    On Leave-One-Out Conditional Mutual Information For Generalization. (arXiv:2207.00581v1 [cs.LG])
    We derive information theoretic generalization bounds for supervised learning algorithms based on a new measure of leave-one-out conditional mutual information (loo-CMI). Contrary to other CMI bounds, which are black-box bounds that do not exploit the structure of the problem and may be hard to evaluate in practice, our loo-CMI bounds can be computed easily and can be interpreted in connection to other notions such as classical leave-one-out cross-validation, stability of the optimization algorithm, and the geometry of the loss-landscape. It applies both to the output of training algorithms as well as their predictions. We empirically validate the quality of the bound by evaluating its predicted generalization gap in scenarios for deep learning. In particular, our bounds are non-vacuous on large-scale image-classification tasks.  ( 2 min )
    Class-wise Thresholding for Robust Out-of-Distribution Detection. (arXiv:2110.15292v3 [cs.LG] UPDATED)
    We consider the problem of detecting OoD(Out-of-Distribution) input data when using deep neural networks, and we propose a simple yet effective way to improve the robustness of several popular OoD detection methods against label shift. Our work is motivated by the observation that most existing OoD detection algorithms consider all training/test data as a whole, regardless of which class entry each input activates (inter-class differences). Through extensive experimentation, we have found that such practice leads to a detector whose performance is sensitive and vulnerable to label shift. To address this issue, we propose a class-wise thresholding scheme that can apply to most existing OoD detection algorithms and can maintain similar OoD detection performance even in the presence of label shift in the test distribution.  ( 2 min )
    HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques. (arXiv:2203.15753v2 [cs.LG] UPDATED)
    Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.
    AdaSparse: Learning Adaptively Sparse Structures for Multi-Domain Click-Through Rate Prediction. (arXiv:2206.13108v2 [cs.IR] UPDATED)
    Click-through rate (CTR) prediction is a fundamental technique in recommendation and advertising systems. Recent studies have proved that learning a unified model to serve multiple domains is effective to improve the overall performance. However, it is still challenging to improve generalization across domains under limited training data, and hard to deploy current solutions due to their computational complexity. In this paper, we propose a simple yet effective framework AdaSparse for multi-domain CTR prediction, which learns adaptively sparse structure for each domain, achieving better generalization across domains with lower computational cost. In AdaSparse, we introduce domain-aware neuron-level weighting factors to measure the importance of neurons, with that for each domain our model can prune redundant neurons to improve generalization. We further add flexible sparsity regularizations to control the sparsity ratio of learned structures. Offline and online experiments show that AdaSparse outperforms previous multi-domain CTR models significantly.
    Expected Scalarised Returns Dominance: A New Solution Concept for Multi-Objective Decision Making. (arXiv:2106.01048v3 [cs.LG] UPDATED)
    In many real-world scenarios, the utility of a user is derived from the single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user's preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this paper we address this challenge by proposing first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also propose a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. We then define a new solution concept called the ESR set, which is a set of policies that are ESR dominant. Finally, we define a new multi-objective distributional tabular reinforcement learning (MOT-DRL) algorithm to learn the ESR set in a multi-objective multi-armed bandit setting.
    From Kepler to Newton: Explainable AI for Science Discovery. (arXiv:2111.12210v5 [cs.AI] UPDATED)
    The Observation--Hypothesis--Prediction--Experimentation loop paradigm for scientific research has been practiced by researchers for years towards scientific discoveries. However, with data explosion in both mega-scale and milli-scale scientific research, it has been sometimes very difficult to manually analyze the data and propose new hypotheses to drive the cycle for scientific discovery. In this paper, we discuss the role of Explainable AI in scientific discovery process by demonstrating an Explainable AI-based paradigm for science discovery. The key is to use Explainable AI to help derive data or model interpretations, hypotheses, as well as scientific discoveries or insights. We show how computational and data-intensive methodology -- together with experimental and theoretical methodology -- can be seamlessly integrated for scientific research. To demonstrate the AI-based science discovery process, and to pay our respect to some of the greatest minds in human history, we show how Kepler's laws of planetary motion and Newton's law of universal gravitation can be rediscovered by (Explainable) AI based on Tycho Brahe's astronomical observation data, whose works were leading the scientific revolution in the 16-17th century. This work also highlights the important role of Explainable AI (as compared to Blackbox AI) in science discovery to help humans prevent or better prepare for the possible technological singularity that may happen in the future, since science is not only about the know how, but also the know why.
    Prioritized training on points that are learnable, worth learning, and not yet learned (workshop version). (arXiv:2107.02565v3 [cs.LG] UPDATED)
    We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are "just right". We propose an information-theoretic acquisition function -- the reducible validation loss -- and compute it with a small proxy model -- GoldiProx -- to efficiently choose training points that maximize information about a validation set. We show that the "hard" (e.g. high loss) points usually selected in the optimization literature are typically noisy, while the "easy" (e.g. low noise) samples often prioritized for curriculum learning confer less information. Further, points with uncertain labels, typically targeted by active learning, tend to be less relevant to the task. In contrast, Goldilocks Selection chooses points that are "just right" and empirically outperforms the above approaches. Moreover, the selected sequence can transfer to other architectures; practitioners can share and reuse it without the need to recreate it.
    Few-Shot Document-Level Relation Extraction. (arXiv:2205.02048v2 [cs.CL] UPDATED)
    We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and construct a benchmark based on two existing supervised learning data sets, DocRED and sciERC. We adapt the state-of-the-art sentence-level method MNAV to the document-level and develop it further for improved domain adaptation. We find FSDLRE to be a challenging setting with interesting new characteristics such as the ability to sample NOTA instances from the support set. The data, code, and trained models are available online (https://github.com/nicpopovic/FREDo).
    Learning Symmetric Embeddings for Equivariant World Models. (arXiv:2204.11371v2 [cs.LG] UPDATED)
    Incorporating symmetries can lead to highly data-efficient and generalizable models by defining equivalence classes of data samples related by transformations. However, characterizing how transformations act on input data is often difficult, limiting the applicability of equivariant models. We propose learning symmetric embedding networks (SENs) that encode an input space (e.g. images), where we do not know the effect of transformations (e.g. rotations), to a feature space that transforms in a known manner under these operations. This network can be trained end-to-end with an equivariant task network to learn an explicitly symmetric representation. We validate this approach in the context of equivariant transition models with 3 distinct forms of symmetry. Our experiments demonstrate that SENs facilitate the application of equivariant networks to data with complex symmetry representations. Moreover, doing so can yield improvements in accuracy and generalization relative to both fully-equivariant and non-equivariant baselines.
    Graph Neural Networks for Graph Drawing. (arXiv:2109.10061v3 [cs.LG] UPDATED)
    Graph Drawing techniques have been developed in the last few years with the purpose of producing aesthetically pleasing node-link layouts. Recently, the employment of differentiable loss functions has paved the road to the massive usage of Gradient Descent and related optimization algorithms. In this paper, we propose a novel framework for the development of Graph Neural Drawers (GND), machines that rely on neural computation for constructing efficient and complex maps. GNDs are Graph Neural Networks (GNNs) whose learning process can be driven by any provided loss function, such as the ones commonly employed in Graph Drawing. Moreover, we prove that this mechanism can be guided by loss functions computed by means of Feedforward Neural Networks, on the basis of supervision hints that express beauty properties, like the minimization of crossing edges. In this context, we show that GNNs can nicely be enriched by positional features to deal also with unlabelled vertexes. We provide a proof-of-concept by constructing a loss function for the edge-crossing and provide quantitative and qualitative comparisons among different GNN models working under the proposed framework.
    TGL: A General Framework for Temporal GNN Training on Billion-Scale Graphs. (arXiv:2203.14883v2 [cs.LG] UPDATED)
    Many real world graphs contain time domain information. Temporal Graph Neural Networks capture temporal information as well as structural and contextual information in the generated dynamic node embeddings. Researchers have shown that these embeddings achieve state-of-the-art performance in many different tasks. In this work, we propose TGL, a unified framework for large-scale offline Temporal Graph Neural Network training where users can compose various Temporal Graph Neural Networks with simple configuration files. TGL comprises five main components, a temporal sampler, a mailbox, a node memory module, a memory updater, and a message passing engine. We design a Temporal-CSR data structure and a parallel sampler to efficiently sample temporal neighbors to formtraining mini-batches. We propose a novel random chunk scheduling technique that mitigates the problem of obsolete node memory when training with a large batch size. To address the limitations of current TGNNs only being evaluated on small-scale datasets, we introduce two large-scale real-world datasets with 0.2 and 1.3 billion temporal edges. We evaluate the performance of TGL on four small-scale datasets with a single GPU and the two large datasets with multiple GPUs for both link prediction and node classification tasks. We compare TGL with the open-sourced code of five methods and show that TGL achieves similar or better accuracy with an average of 13x speedup. Our temporal parallel sampler achieves an average of 173x speedup on a multi-core CPU compared with the baselines. On a 4-GPU machine, TGL can train one epoch of more than one billion temporal edges within 1-10 hours. To the best of our knowledge, this is the first work that proposes a general framework for large-scale Temporal Graph Neural Networks training on multiple GPUs.
    EvoVGM: a Deep Variational Generative Model for Evolutionary Parameter Estimation. (arXiv:2205.13034v2 [cs.LG] UPDATED)
    Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69, K80 and GTR. We train the model via a low-variance stochastic estimator and a gradient ascent algorithm. Here, we analyze the consistency and effectiveness of EvoVGM on synthetic sequence alignments simulated with several evolutionary scenarios and different sizes. Finally, we highlight the robustness of a fine-tuned EvoVGM model using a sequence alignment of gene S of coronaviruses.
    Distributed saddle point problems for strongly concave-convex functions. (arXiv:2202.05812v2 [math.OC] UPDATED)
    In this paper, we propose GT-GDA, a distributed optimization method to solve saddle point problems of the form: $\min_{\mathbf{x}} \max_{\mathbf{y}} \{F(\mathbf{x},\mathbf{y}) :=G(\mathbf{x}) + \langle \mathbf{y}, \overline{P} \mathbf{x} \rangle - H(\mathbf{y})\}$, where the functions $G(\cdot)$, $H(\cdot)$, and the the coupling matrix $\overline{P}$ are distributed over a strongly connected network of nodes. GT-GDA is a first-order method that uses gradient tracking to eliminate the dissimilarity caused by heterogeneous data distribution among the nodes. In the most general form, GT-GDA includes a consensus over the local coupling matrices to achieve the optimal (unique) saddle point, however, at the expense of increased communication. To avoid this, we propose a more efficient variant GT-GDA-Lite that does not incur the additional communication and analyze its convergence in various scenarios. We show that GT-GDA converges linearly to the unique saddle point solution when $G(\cdot)$ is smooth and convex, $H(\cdot)$ is smooth and strongly convex, and the global coupling matrix $\overline{P}$ has full column rank. We further characterize the regime under which GT-GDA exhibits a network topology-independent convergence behavior. We next show the linear convergence of GT-GDA to an error around the unique saddle point, which goes to zero when the coupling cost ${\langle \mathbf y, \overline{P} \mathbf x \rangle}$ is common to all nodes, or when $G(\cdot)$ and $H(\cdot)$ are quadratic. Numerical experiments illustrate the convergence properties and importance of GT-GDA and GT-GDA-Lite for several applications.
    Causal Reasoning Meets Visual Representation Learning: A Prospective Study. (arXiv:2204.12037v5 [cs.CV] UPDATED)
    Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts of multi-modal heterogeneous spatial/temporal/spatial-temporal data in big data era, the lack of interpretability, robustness, and out-of-distribution generalization are becoming the challenges of the existing visual models. The majority of the existing methods tend to fit the original data/variable distributions and ignore the essential causal relations behind the multi-modal knowledge, which lacks an unified guidance and analysis about why modern visual representation learning methods are easily collapse into data bias and have limited generalization and cognitive abilities. Inspired by the strong inference ability of human-level agents, recent years have therefore witnessed great effort in developing causal reasoning paradigms to realize robust representation and model learning with good cognitive ability. In this paper, we conduct a comprehensive review of existing causal reasoning methods for visual representation learning, covering fundamental theories, models, and datasets. The limitations of current methods and datasets are also discussed. Moreover, we propose some prospective challenges, opportunities, and future research directions for benchmarking causal reasoning algorithms in visual representation learning. This paper aims to provide a comprehensive overview of this emerging field, attract attention, encourage discussions, bring to the forefront the urgency of developing novel causal reasoning methods, publicly available benchmarks, and consensus-building standards for reliable visual representation learning and related real-world applications more efficiently.
    ML4ML: Automated Invariance Testing for Machine Learning Models. (arXiv:2109.12926v2 [cs.LG] UPDATED)
    In machine learning (ML) workflows, determining the invariance qualities of an ML model is a common testing procedure. Traditionally, invariance qualities are evaluated using simple formula-based scores, e.g., accuracy. In this paper, we show that testing the invariance qualities of ML models may result in complex visual patterns that cannot be classified using simple formulas. In order to test ML models by analyzing such visual patterns automatically using other ML models, we propose a systematic framework that is applicable to a variety of invariance qualities. We demonstrate the effectiveness and feasibility of the framework by developing ML4ML models (assessors) for determining rotation-, brightness-, and size-variances of a collection of neural networks. Our testing results show that the trained ML4ML assessors can perform such analytical tasks with sufficient accuracy.
    Enhancing Computational Fluid Dynamics with Machine Learning. (arXiv:2110.02085v2 [physics.flu-dyn] UPDATED)
    Machine learning is rapidly becoming a core technology for scientific computing, with numerous opportunities to advance the field of computational fluid dynamics. In this Perspective, we highlight some of the areas of highest potential impact, including to accelerate direct numerical simulations, to improve turbulence closure modeling, and to develop enhanced reduced-order models. We also discuss emerging areas of machine learning that are promising for computational fluid dynamics, as well as some potential limitations that should be taken into account.
    Topology-Aware Network Pruning using Multi-stage Graph Embedding and Reinforcement Learning. (arXiv:2102.03214v2 [cs.CV] UPDATED)
    Model compression is an essential technique for deploying deep neural networks (DNNs) on power and memory-constrained resources. However, existing model-compression methods often rely on human expertise and focus on parameters' local importance, ignoring the rich topology information within DNNs. In this paper, we propose a novel multi-stage graph embedding technique based on graph neural networks (GNNs) to identify DNN topologies and use reinforcement learning (RL) to find a suitable compression policy. We performed resource-constrained (i.e., FLOPs) channel pruning and compared our approach with state-of-the-art model compression methods. We evaluated our method on various models from typical to mobile-friendly networks, such as ResNet family, VGG-16, MobileNet-v1/v2, and ShuffleNet. Results show that our method can achieve higher compression ratios with a minimal fine-tuning cost yet yields outstanding and competitive performance.
    Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes. (arXiv:2207.00486v1 [cs.LG])
    A determinantal point process (DPP) is an elegant model that assigns a probability to every subset of a collection of $n$ items. While conventionally a DPP is parameterized by a symmetric kernel matrix, removing this symmetry constraint, resulting in nonsymmetric DPPs (NDPPs), leads to significant improvements in modeling power and predictive performance. Recent work has studied an approximate Markov chain Monte Carlo (MCMC) sampling algorithm for NDPPs restricted to size-$k$ subsets (called $k$-NDPPs). However, the runtime of this approach is quadratic in $n$, making it infeasible for large-scale settings. In this work, we develop a scalable MCMC sampling algorithm for $k$-NDPPs with low-rank kernels, thus enabling runtime that is sublinear in $n$. Our method is based on a state-of-the-art NDPP rejection sampling algorithm, which we enhance with a novel approach for efficiently constructing the proposal distribution. Furthermore, we extend our scalable $k$-NDPP sampling algorithm to NDPPs without size constraints. Our resulting sampling method has polynomial time complexity in the rank of the kernel, while the existing approach has runtime that is exponential in the rank. With both a theoretical analysis and experiments on real-world datasets, we verify that our scalable approximate sampling algorithms are orders of magnitude faster than existing sampling approaches for $k$-NDPPs and NDPPs.
    Enhancing cluster analysis via topological manifold learning. (arXiv:2207.00510v1 [cs.LG])
    We discuss topological aspects of cluster analysis and show that inferring the topological structure of a dataset before clustering it can considerably enhance cluster detection: theoretical arguments and empirical evidence show that clustering embedding vectors, representing the structure of a data manifold instead of the observed feature vectors themselves, is highly beneficial. To demonstrate, we combine manifold learning method UMAP for inferring the topological structure with density-based clustering method DBSCAN. Synthetic and real data results show that this both simplifies and improves clustering in a diverse set of low- and high-dimensional problems including clusters of varying density and/or entangled shapes. Our approach simplifies clustering because topological pre-processing consistently reduces parameter sensitivity of DBSCAN. Clustering the resulting embeddings with DBSCAN can then even outperform complex methods such as SPECTACL and ClusterGAN. Finally, our investigation suggests that the crucial issue in clustering does not appear to be the nominal dimension of the data or how many irrelevant features it contains, but rather how \textit{separable} the clusters are in the ambient observation space they are embedded in, which is usually the (high-dimensional) Euclidean space defined by the features of the data. Our approach is successful because we perform the cluster analysis after projecting the data into a more suitable space that is optimized for separability, in some sense.
    Transfer learning of phase transitions in percolation and directed percolation. (arXiv:2112.15516v5 [cond-mat.stat-mech] UPDATED)
    The latest advances of statistical physics have shown remarkable performance of machine learning in identifying phase transitions. In this paper, we apply domain adversarial neural network (DANN) based on transfer learning to studying non-equilibrium and equilibrium phase transition models, which are percolation model and directed percolation (DP) model, respectively. With the DANN, only a small fraction of input configurations (2d images) needs to be labeled, which is automatically chosen, in order to capture the critical point. To learn the DP model, the method is refined by an iterative procedure in determining the critical point, which is a prerequisite for the data collapse in calculating the critical exponent $\nu_{\perp}$. We then apply the DANN to a two-dimensional site percolation with configurations filtered to include only the largest cluster which may contain the information related to the order parameter. The DANN learning of both models yields reliable results which are comparable to the ones from Monte Carlo simulations. Our study also shows that the DANN can achieve quite high accuracy at much lower cost, compared to the supervised learning.
    CRISP: A Probabilistic Model for Individual-Level COVID-19 Infection Risk Estimation Based on Contact Data. (arXiv:2006.04942v2 [cs.SI] UPDATED)
    We present CRISP (COVID-19 Risk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through a population based on the SEIR model where we assume access to (1) mutual contacts between pairs of individuals across time across various channels (e.g., Bluetooth contact traces), as well as (2) test outcomes at given times for infection, exposure and immunity tests. Our micro-level model keeps track of the infection state for each individual at every point in time, ranging from susceptible, exposed, infectious to recovered. We develop both a Monte Carlo EM as well as a message passing algorithm to infer contact-channel specific infection transmission probabilities. Our Monte Carlo algorithm uses Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period of analysis, given the latent infection status of all contacts and test outcome data. Experimental results with simulated data demonstrate our CRISP model can be parametrized by the reproduction factor $R_0$ and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Moreover, the block-Gibbs sampling algorithm is able to support efficient testing in a test-trace-isolate approach to contain COVID-19 infection spread. To the best of our knowledge, this is the first model with efficient inference for COVID-19 infection spread based on individual-level contact data; most epidemic models are macro-level models that reason over entire populations. The implementation of CRISP is available in Python and C++ at https://github.com/zalandoresearch/CRISP.  ( 3 min )
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v2 [stat.ML] UPDATED)
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.
    On Optimal Control and Expectation-Maximisation: Theory and an Outlook Towards Algorithms. (arXiv:2205.03279v2 [cs.LG] UPDATED)
    In this work we demonstrate how both the Stochastic and Risk Sensitive Optimal Control problem can be treated by means of the Expectation-Maximisation algorithm. We show how such a treatment materialises into two separate iterative programs that each generate a unique but closely related sequence of density functions. We motivate to interpret these density functions as beliefs, ergo as probabilistic proxies for the deterministic optimal policy. More formally two fixed point iteration schemes are derived with the stationary point coinciding with the deterministic optimal policies on behalf of the proven convergence of Expectation-Maximisation methods. We are inclined to point out our results are intimately related with the paradigm of Control as Inference. Control as inference here refers to a collection of approaches which aim is also to recast optimal control as an instance of probabilistic inference. Although said paradigm already resulted in the development of several powerful Reinforcement Learning algorithms, the fundamental problem statement usually is introduced by teleological arguments. We argue that the present results demonstrate that earlier established Control as Inference frameworks in fact isolate a single step from either of the proposed iterative programs. In any case the present treatment provides them with a deontological argument of validity. By exposing the underlying technical mechanism we aim to contribute to the general acceptance of Control as Inference as a framework superseding the present Optimal Control paradigm. In order to motivate the general relevance of the presented treatment we further discuss parallels with Path Integral Control and other areas of research before sketching the outlines of future algorithmic development.
    auton-survival: an Open-Source Package for Regression, Counterfactual Estimation, Evaluation and Phenotyping with Censored Time-to-Event Data. (arXiv:2204.07276v3 [cs.LG] UPDATED)
    Applications of machine learning in healthcare often require working with time-to-event prediction tasks including prognostication of an adverse event, re-hospitalization or death. Such outcomes are typically subject to censoring due to loss of follow up. Standard machine learning methods cannot be applied in a straightforward manner to datasets with censored outcomes. In this paper, we present auton-survival, an open-source repository of tools to streamline working with censored time-to-event or survival data. auton-survival includes tools for survival regression, adjustment in the presence of domain shift, counterfactual estimation, phenotyping for risk stratification, evaluation, as well as estimation of treatment effects. Through real world case studies employing a large subset of the SEER oncology incidence data, we demonstrate the ability of auton-survival to rapidly support data scientists in answering complex health and epidemiological questions.
    Learning to correct spectral methods for simulating turbulent flows. (arXiv:2207.00556v1 [cs.LG])
    Despite their ubiquity throughout science and engineering, only a handful of partial differential equations (PDEs) have analytical, or closed-form solutions. This motivates a vast amount of classical work on numerical simulation of PDEs and more recently, a whirlwind of research into data-driven techniques leveraging machine learning (ML). A recent line of work indicates that a hybrid of classical numerical techniques with machine learning can offer significant improvements over either approach alone. In this work, we show that the choice of the numerical scheme is crucial when incorporating physics-based priors. We build upon Fourier-based spectral methods, which are considerably more efficient than other numerical schemes for simulating PDEs with smooth and periodic solutions. Specifically, we develop ML-augmented spectral solvers for three model PDEs of fluid dynamics, which improve upon the accuracy of standard spectral solvers at the same resolution. We also demonstrate a handful of key design principles for combining machine learning and numerical methods for solving PDEs.
    SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition. (arXiv:2202.04849v2 [cs.LG] UPDATED)
    Methods that extract policy primitives from offline demonstrations using deep generative models have shown promise at accelerating reinforcement learning(RL) for new tasks. Intuitively, these methods should also help to trainsafeRLagents because they enforce useful skills. However, we identify these techniques are not well equipped for safe policy learning because they ignore negative experiences(e.g., unsafe or unsuccessful), focusing only on positive experiences, which harms their ability to generalize to new tasks safely. Rather, we model the latentsafetycontextusing principled contrastive training on an offline dataset of demonstrations from many tasks, including both negative and positive experiences. Using this late variable, our RL framework, SAFEty skill pRiors (SAFER) extracts task-specific safe primitive skills to safely and successfully generalize to new tasks. In the inference stage, policies trained with SAFER learn to compose safe skills into successful policies. We theoretically characterize why SAFER can enforce safe policy learning and demonstrate its effectiveness on several complex safety-critical robotic grasping tasks inspired by the game Operation, in which SAFERoutperforms state-of-the-art primitive learning methods in success and safety.
    Shai-am: A Machine Learning Platform for Investment Strategies. (arXiv:2207.00436v1 [q-fin.GN])
    The finance industry has adopted machine learning (ML) as a form of quantitative research to support better investment decisions, yet there are several challenges often overlooked in practice. (1) ML code tends to be unstructured and ad hoc, which hinders cooperation with others. (2) Resource requirements and dependencies vary depending on which algorithm is used, so a flexible and scalable system is needed. (3) It is difficult for domain experts in traditional finance to apply their experience and knowledge in ML-based strategies unless they acquire expertise in recent technologies. This paper presents Shai-am, an ML platform integrated with our own Python framework. The platform leverages existing modern open-source technologies, managing containerized pipelines for ML-based strategies with unified interfaces to solve the aforementioned issues. Each strategy implements the interface defined in the core framework. The framework is designed to enhance reusability and readability, facilitating collaborative work in quantitative research. Shai-am aims to be a pure AI asset manager for solving various tasks in financial markets.  ( 2 min )
    Reinforcement Learning of Multi-Domain Dialog Policies Via Action Embeddings. (arXiv:2207.00468v1 [cs.CL])
    Learning task-oriented dialog policies via reinforcement learning typically requires large amounts of interaction with users, which in practice renders such methods unusable for real-world applications. In order to reduce the data requirements, we propose to leverage data from across different dialog domains, thereby reducing the amount of data required from each given domain. In particular, we propose to learn domain-agnostic action embeddings, which capture general-purpose structure that informs the system how to act given the current dialog context, and are then specialized to a specific domain. We show how this approach is capable of learning with significantly less interaction with users, with a reduction of 35% in the number of dialogs required to learn, and to a higher level of proficiency than training separate policies for each domain on a set of simulated domains.
    Generative Adversarial Networks and Image-Based Malware Classification. (arXiv:2207.00421v1 [cs.CR])
    For efficient malware removal, determination of malware threat levels, and damage estimation, malware family classification plays a critical role. In this paper, we extract features from malware executable files and represent them as images using various approaches. We then focus on Generative Adversarial Networks (GAN) for multiclass classification and compare our GAN results to other popular machine learning techniques, including Support Vector Machine (SVM), XGBoost, and Restricted Boltzmann Machines (RBM). We find that the AC-GAN discriminator is generally competitive with other machine learning techniques. We also evaluate the utility of the GAN generative model for adversarial attacks on image-based malware detection. While AC-GAN generated images are visually impressive, we find that they are easily distinguished from real malware images using any of several learning techniques. This result indicates that our GAN generated images would be of little value in adversarial attacks.  ( 2 min )
    The "AI+R"-tree: An Instance-optimized R-tree. (arXiv:2207.00550v1 [cs.DB])
    The emerging class of instance-optimized systems has shown potential to achieve high performance by specializing to a specific data and query workloads. Particularly, Machine Learning (ML) techniques have been applied successfully to build various instance-optimized components (e.g., learned indexes). This paper investigates to leverage ML techniques to enhance the performance of spatial indexes, particularly the R-tree, for a given data and query workloads. As the areas covered by the R-tree index nodes overlap in space, upon searching for a specific point in space, multiple paths from root to leaf may potentially be explored. In the worst case, the entire R-tree could be searched. In this paper, we define and use the overlap ratio to quantify the degree of extraneous leaf node accesses required by a range query. The goal is to enhance the query performance of a traditional R-tree for high-overlap range queries as they tend to incur long running-times. We introduce a new AI-tree that transforms the search operation of an R-tree into a multi-label classification task to exclude the extraneous leaf node accesses. Then, we augment a traditional R-tree to the AI-tree to form a hybrid "AI+R"-tree. The "AI+R"-tree can automatically differentiate between the high- and low-overlap queries using a learned model. Thus, the "AI+R"-tree processes high-overlap queries using the AI-tree, and the low-overlap queries using the R-tree. Experiments on real datasets demonstrate that the "AI+R"-tree can enhance the query performance over a traditional R-tree by up to 500%.  ( 3 min )
    KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. (arXiv:1805.05071v3 [stat.ML] UPDATED)
    We consider $K$-armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $\kappa\ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $\kappa$ is the optimal problem-dependent constant. This constant $\kappa$ depends on the model $\mathcal{D}$ considered (the family of possible distributions over the arms). M\'enard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than $1$. We extend this result to the non-parametric case of all distributions over $[0,1]$. We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order $\sqrt{KT}$, and the KL-UCB strategy by Capp\'e et al. (2013), for which we provide in passing the first analysis of an optimal distribution-dependent $\kappa\ln T$ regret bound in the model of all distributions over $[0,1]$. We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for $K$-armed stochastic bandits.
    Personalized Diagnostic Tool for Thyroid Cancer Classification using Multi-view Ultrasound. (arXiv:2207.00496v1 [cs.CV])
    Over the past decades, the incidence of thyroid cancer has been increasing globally. Accurate and early diagnosis allows timely treatment and helps to avoid over-diagnosis. Clinically, a nodule is commonly evaluated from both transverse and longitudinal views using thyroid ultrasound. However, the appearance of the thyroid gland and lesions can vary dramatically across individuals. Identifying key diagnostic information from both views requires specialized expertise. Furthermore, finding an optimal way to integrate multi-view information also relies on the experience of clinicians and adds further difficulty to accurate diagnosis. To address these, we propose a personalized diagnostic tool that can customize its decision-making process for different patients. It consists of a multi-view classification module for feature extraction and a personalized weighting allocation network that generates optimal weighting for different views. It is also equipped with a self-supervised view-aware contrastive loss to further improve the model robustness towards different patient groups. Experimental results show that the proposed framework can better utilize multi-view information and outperform the competing methods.  ( 2 min )
    Implicit adaptation of mesh model of transient heat conduction problem. (arXiv:2207.00444v1 [eess.SY])
    Considering high-temperature heating, the equations of transient heat conduction model require an adaptation, i.e. the dependence of thermophysical parameters of the model on the temperature is to be identified for each specific material to be heated. This problem is most often solved by approximation of the tabular data on the measurements of the required parameters, which can be found in the literature, by means of regression equations. But, for example, considering the steel heating process, this approach is difficult to be implemented due to the lack of tabular discrete measurements for many grades of steel, such as alloyed ones. In this paper, the new approach is proposed, which is based on a solution of a related variational problem. Its main idea is to substitute the adaptation process in the classical sense (i.e., to find the dependencies of thermophysical parameters on temperature) with 'supervised learning' of a mesh model on the basis of the technological data received from the plant. The equations to adjust the parameters of the transient heat conduction model, which are related to the thermophysical coefficients, have been derived. A numerical experiment is conducted for steel of a particular group of grades, for which enough both technological as well as tabular data are available. As a result, the 'trained' mesh model, which has not received explicitly any information about the physical and chemical properties of the heated substance, demonstrated an average error of 18.820 C, which is quite close to the average error of the model adapted classically on the basis of the tabular data (18.10 C).  ( 3 min )
    How can spherical CNNs benefit ML-based diffusion MRI parameter estimation?. (arXiv:2207.00572v1 [eess.IV])
    This paper demonstrates spherical convolutional neural networks (S-CNN) offer distinct advantages over conventional fully-connected networks (FCN) at estimating scalar parameters of tissue microstructure from diffusion MRI (dMRI). Such microstructure parameters are valuable for identifying pathology and quantifying its extent. However, current clinical practice commonly acquires dMRI data consisting of only 6 diffusion weighted images (DWIs), limiting the accuracy and precision of estimated microstructure indices. Machine learning (ML) has been proposed to address this challenge. However, existing ML-based methods are not robust to differing dMRI gradient sampling schemes, nor are they rotation equivariant. Lack of robustness to sampling schemes requires a new network to be trained for each scheme, complicating the analysis of data from multiple sources. A possible consequence of the lack of rotational equivariance is that the training dataset must contain a diverse range of microstucture orientations. Here, we show spherical CNNs represent a compelling alternative that is robust to new sampling schemes as well as offering rotational equivariance. We show the latter can be leveraged to decrease the number of training datapoints required.  ( 2 min )
    A Shallow Ritz Method for Elliptic Problems with Singular Sources. (arXiv:2107.12013v3 [math.NA] UPDATED)
    In this paper, a shallow Ritz-type neural network for solving elliptic equations with delta function singular sources on an interface is developed. There are three novel features in the present work; namely, (i) the delta function singularity is naturally removed, (ii) level set function is introduced as a feature input, (iii) it is completely shallow, comprising only one hidden layer. We first introduce the energy functional of the problem and then transform the contribution of singular sources to a regular surface integral along the interface. In such a way, the delta function singularity can be naturally removed without introducing a discrete one that is commonly used in traditional regularization methods, such as the well-known immersed boundary method. The original problem is then reformulated as a minimization problem. We propose a shallow Ritz-type neural network with one hidden layer to approximate the global minimizer of the energy functional. As a result, the network is trained by minimizing the loss function that is a discrete version of the energy. In addition, we include the level set function of the interface as a feature input of the network and find that it significantly improves the training efficiency and accuracy. We perform a series of numerical tests to show the accuracy of the present method and its capability for problems in irregular domains and higher dimensions.
    Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models. (arXiv:2110.02891v2 [cs.LG] UPDATED)
    Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, under an unsupervised-style setting, typical training algorithms for controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but unpaired samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. The proposed method is simple yet effective, where we use a style transformation module to transfer target style information into an unrelated style input. This method enables training using unpaired content and style samples and thereby mitigate the training-inference mismatch. We apply style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. We conduct thorough evaluation, including both quantitative and qualitative user studies. Our results show that by mitigating the training-inference mismatch with the proposed style equalization, we achieve style replication scores comparable to real data in our user studies.  ( 3 min )
    An Artificial Intelligence Dataset for Solar Energy Locations in India. (arXiv:2202.01340v2 [cs.LG] UPDATED)
    Rapid development of renewable energy sources, particularly solar photovoltaics (PV), is critical to mitigate climate change. As a result, India has set ambitious goals to install 500 gigawatts of solar energy capacity by 2030. Given the large footprint projected to meet renewables energy targets, the potential for land use conflicts over environmental values is high. To expedite development of solar energy, land use planners will need access to up-to-date and accurate geo-spatial information of PV infrastructure. In this work, we developed a spatially explicit machine learning model to map utility-scale solar projects across India using freely available satellite imagery with a mean accuracy of 92%. Our model predictions were validated by human experts to obtain a dataset of 1363 solar PV farms. Using this dataset, we measure the solar footprint across India and quantified the degree of landcover modification associated with the development of PV infrastructure. Our analysis indicates that over 74% of solar development In India was built on landcover types that have natural ecosystem preservation, or agricultural value.
    Receptive Field Analysis of Temporal Convolutional Networks for Monaural Speech Dereverberation. (arXiv:2204.06439v3 [cs.SD] UPDATED)
    Speech dereverberation is often an important requirement in robust speech processing tasks. Supervised deep learning (DL) models give state-of-the-art performance for single-channel speech dereverberation. Temporal convolutional networks (TCNs) are commonly used for sequence modelling in speech enhancement tasks. A feature of TCNs is that they have a receptive field (RF) dependent on the specific model configuration which determines the number of input frames that can be observed to produce an individual output frame. It has been shown that TCNs are capable of performing dereverberation of simulated speech data, however a thorough analysis, especially with focus on the RF is yet lacking in the literature. This paper analyses dereverberation performance depending on the model size and the RF of TCNs. Experiments using the WHAMR corpus which is extended to include room impulse responses (RIRs) with larger T60 values demonstrate that a larger RF can have significant improvement in performance when training smaller TCN models. It is also demonstrated that TCNs benefit from a wider RF when dereverberating RIRs with larger RT60 values.
    Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds. (arXiv:2207.00531v1 [cs.CV])
    Masked autoencoding has become a successful pre-training paradigm for Transformer models for text, images, and recently, point clouds. Raw automotive datasets are a suitable candidate for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward point clouds which are small, dense and have homogeneous point density. In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Compared to existing self-supervised methods for automotive data, Voxel-MAE displays up to $2\times$ performance increase. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent. Code will be released.  ( 3 min )
    Learning Mean Field Games: A Survey. (arXiv:2205.12944v2 [cs.LG] UPDATED)
    Non-cooperative and cooperative games with a very large number of players have many applications but remain generally intractable when the number of players increases. Introduced by Lasry and Lions, and Huang, Caines and Malham\'e, Mean Field Games (MFGs) rely on a mean-field approximation to allow the number of players to grow to infinity. Traditional methods for solving these games generally rely on solving partial or stochastic differential equations with a full knowledge of the model. Recently, Reinforcement Learning (RL) has appeared promising to solve complex problems. By combining MFGs and RL, we hope to solve games at a very large scale both in terms of population size and environment complexity. In this survey, we review the quickly growing recent literature on RL methods to learn Nash equilibria in MFGs. We first identify the most common settings (static, stationary, and evolutive). We then present a general framework for classical iterative methods (based on best-response computation or policy evaluation) to solve MFGs in an exact way. Building on these algorithms and the connection with Markov Decision Processes, we explain how RL can be used to learn MFG solutions in a model-free way. Last, we present numerical illustrations on a benchmark problem, and conclude with some perspectives.
    Habitat 2.0: Training Home Assistants to Rearrange their Habitat. (arXiv:2106.14405v2 [cs.LG] UPDATED)
    We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from 'hand-off problems', and (3) SPA pipelines are more brittle than RL policies.
    Improved Generalization Bounds for Adversarially Robust Learning. (arXiv:1810.02180v5 [cs.LG] UPDATED)
    We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$ to $O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$ for any $\alpha > 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.
    InQSS: a speech intelligibility and quality assessment model using a multi-task learning network. (arXiv:2111.02585v3 [cs.SD] UPDATED)
    Speech intelligibility and quality assessment models are essential tools for researchers to evaluate and improve speech processing models. However, only a few studies have investigated multi-task models for intelligibility and quality assessment due to the limitations of available data. In this study, we released TMHINT-QI, the first Chinese speech dataset that records the quality and intelligibility scores of clean, noisy, and enhanced utterances. Then, we propose InQSS, a non-intrusive multi-task learning framework for intelligibility and quality assessment. We evaluated the InQSS on both the training-from-scratch and the pretrained models. The experimental results confirm the effectiveness of the InQSS framework. In addition, the resulting model can predict not only the intelligibility scores but also the quality scores of a speech signal.
    Behavioral Player Rating in Competitive Online Shooter Games. (arXiv:2207.00528v1 [cs.LG])
    Competitive online games use rating systems for matchmaking; progression-based algorithms that estimate the skill level of players with interpretable ratings in terms of the outcome of the games they played. However, the overall experience of players is shaped by factors beyond the sole outcome of their games. In this paper, we engineer several features from in-game statistics to model players and create ratings that accurately represent their behavior and true performance level. We then compare the estimating power of our behavioral ratings against ratings created with three mainstream rating systems by predicting rank of players in four popular game modes from the competitive shooter genre. Our results show that the behavioral ratings present more accurate performance estimations while maintaining the interpretability of the created representations. Considering different aspects of the playing behavior of players and using behavioral ratings for matchmaking can lead to match-ups that are more aligned with players' goals and interests, consequently resulting in a more enjoyable gaming experience.
    Towards Explanation for Unsupervised Graph-Level Representation Learning. (arXiv:2205.09934v2 [cs.LG] UPDATED)
    Due to the superior performance of Graph Neural Networks (GNNs) in various domains, there is an increasing interest in the GNN explanation problem "\emph{which fraction of the input graph is the most crucial to decide the model's decision?}" Existing explanation methods focus on the supervised settings, \eg, node classification and graph classification, while the explanation for unsupervised graph-level representation learning is still unexplored. The opaqueness of the graph representations may lead to unexpected risks when deployed for high-stake decision-making scenarios. In this paper, we advance the Information Bottleneck principle (IB) to tackle the proposed explanation problem for unsupervised graph representations, which leads to a novel principle, \textit{Unsupervised Subgraph Information Bottleneck} (USIB). We also theoretically analyze the connection between graph representations and explanatory subgraphs on the label space, which reveals that the expressiveness and robustness of representations benefit the fidelity of explanatory subgraphs. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our developed explainer and the validity of our theoretical analysis.  ( 2 min )
    Robust subgroup discovery. (arXiv:2103.13686v4 [cs.LG] UPDATED)
    We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v3 [cs.LG] UPDATED)
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
    Evaluating the Explainers: Black-Box Explainable Machine Learning for Student Success Prediction in MOOCs. (arXiv:2207.00551v1 [cs.LG])
    Neural networks are ubiquitous in applied machine learning for education. Their pervasive success in predictive performance comes alongside a severe weakness, the lack of explainability of their decisions, especially relevant in human-centric fields. We implement five state-of-the-art methodologies for explaining black-box machine learning models (LIME, PermutationSHAP, KernelSHAP, DiCE, CEM) and examine the strengths of each approach on the downstream task of student performance prediction for five massive open online courses. Our experiments demonstrate that the families of explainers do not agree with each other on feature importance for the same Bidirectional LSTM models with the same representative set of students. We use Principal Component Analysis, Jensen-Shannon distance, and Spearman's rank-order correlation to quantitatively cross-examine explanations across methods and courses. Furthermore, we validate explainer performance across curriculum-based prerequisite relationships. Our results come to the concerning conclusion that the choice of explainer is an important decision and is in fact paramount to the interpretation of the predictive results, even more so than the course the model is trained on. Source code and models are released at this http URL
    Learning Lattice Quantum Field Theories with Equivariant Continuous Flows. (arXiv:2207.00283v1 [hep-lat])
    We propose a novel machine learning method for sampling from the high-dimensional probability distributions of Lattice Quantum Field Theories. Instead of the deep architectures used so far for this task, our proposal is based on a single neural ODE layer and incorporates the full symmetries of the problem. We test our model on the $\phi^4$ theory, showing that it systematically outperforms previously proposed flow-based methods in sampling efficiency, and the improvement is especially pronounced for larger lattices. Compared to the previous baseline model, we improve a key metric, the effective sample size, from 1% to 91% on a lattice of size $32\times 32$. We also demonstrate that our model can successfully learn a continuous family of theories at once, and the results of learning can be transferred to larger lattices. Such generalization capacities further accentuate the potential advantages of machine learning methods compared to traditional MCMC-based methods.
    Secure Forward Aggregation for Vertical Federated Neural Networks. (arXiv:2207.00165v1 [cs.CR])
    Vertical federated learning (VFL) is attracting much attention because it enables cross-silo data cooperation in a privacy-preserving manner. While most research works in VFL focus on linear and tree models, deep models (e.g., neural networks) are not well studied in VFL. In this paper, we focus on SplitNN, a well-known neural network framework in VFL, and identify a trade-off between data security and model performance in SplitNN. Briefly, SplitNN trains the model by exchanging gradients and transformed data. On the one hand, SplitNN suffers from the loss of model performance since multiply parties jointly train the model using transformed data instead of raw data, and a large amount of low-level feature information is discarded. On the other hand, a naive solution of increasing the model performance through aggregating at lower layers in SplitNN (i.e., the data is less transformed and more low-level feature is preserved) makes raw data vulnerable to inference attacks. To mitigate the above trade-off, we propose a new neural network protocol in VFL called Security Forward Aggregation (SFA). It changes the way of aggregating the transformed data and adopts removable masks to protect the raw data. Experiment results show that networks with SFA achieve both data security and high model performance.
    Conditional Variable Selection for Intelligent Test. (arXiv:2207.00335v1 [cs.LG])
    Intelligent test requires efficient and effective analysis of high-dimensional data in a large scale. Traditionally, the analysis is often conducted by human experts, but it is not scalable in the era of big data. To tackle this challenge, variable selection has been recently introduced to intelligent test. However, in practice, we encounter scenarios where certain variables (e.g. some specific processing conditions for a device under test) must be maintained after variable selection. We call this conditional variable selection, which has not been well investigated for embedded or deep-learning-based variable selection methods. In this paper, we discuss a novel conditional variable selection framework that can select the most important candidate variables given a set of preselected variables.
    Rapid training of quantum recurrent neural network. (arXiv:2207.00378v1 [quant-ph])
    Time series prediction is the crucial task for many human activities e.g. weather forecasts or predicting stock prices. One solution to this problem is to use Recurrent Neural Networks (RNNs). Although they can yield accurate predictions, their learning process is slow and complex. Here we propose a Quantum Recurrent Neural Network (QRNN) to address these obstacles. The design of the network is based on the continuous-variable quantum computing paradigm. We demonstrate that the network is capable of learning time dependence of a few types of temporal data. Our numerical simulations show that the QRNN converges to optimal weights in fewer epochs than the classical network. Furthermore, for a small number of trainable parameters it can achieve lower loss than the latter.
    Non-Parametric Inference of Relational Dependence. (arXiv:2207.00163v1 [stat.ML])
    Independence testing plays a central role in statistical and causal inference from observational data. Standard independence tests assume that the data samples are independent and identically distributed (i.i.d.) but that assumption is violated in many real-world datasets and applications centered on relational systems. This work examines the problem of estimating independence in data drawn from relational systems by defining sufficient representations for the sets of observations influencing individual instances. Specifically, we define marginal and conditional independence tests for relational data by considering the kernel mean embedding as a flexible aggregation function for relational variables. We propose a consistent, non-parametric, scalable kernel test to operationalize the relational independence test for non-i.i.d. observational data under a set of structural assumptions. We empirically evaluate our proposed method on a variety of synthetic and semi-synthetic networks and demonstrate its effectiveness compared to state-of-the-art kernel-based independence tests.
    Energy Efficient Routing For Underwater Acoustic Sensor Network Using Genetic Algorithm. (arXiv:2207.00416v1 [cs.NI])
    In underwater acoustic sensor networks (UWASN), energy-reliable data transmission is a challenging task. This is due to acoustic transmission disturbances caused by excessive noise, exceptionally long propagation delays, a high bit error rate, limited bandwidth capability, and interference. One of the most important issues of UWASN for research is how to extend the life span of data transmission. Data transfer from a source node to a destination node in UWASN is a complicated topic for researchers. Many routing algorithms, such as vector base forwarding and depth base routing, have been developed in past years. We propose a genetic algorithm-based optimization method for improving the energy efficiency of data transmission in the routing path from a source node to a destination node.
    VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations. (arXiv:2207.00221v1 [cs.CV])
    Vision-Language Pretraining (VLP) models have recently successfully facilitated many cross-modal downstream tasks. Most existing works evaluated their systems by comparing the fine-tuned downstream task performance. However, only average downstream task accuracy provides little information about the pros and cons of each VLP method, let alone provides insights on how the community can improve the systems in the future. Inspired by the CheckList for testing natural language processing, we introduce VL-CheckList, a novel framework to understand the capabilities of VLP models. The proposed method divides the image-texting ability of a VLP model into three categories: objects, attributes, and relations, and uses a novel taxonomy to further break down these three aspects. We conduct comprehensive studies to analyze seven recently popular VLP models via the proposed framework. Results confirm the effectiveness of the proposed method by revealing fine-grained differences among the compared models that were not visible from downstream task-only evaluation. Further results show promising research direction in building better VLP models. Data and Code: https://github.com/om-ai-lab/VL-CheckList
    Performative Reinforcement Learning. (arXiv:2207.00046v1 [cs.LG])
    We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both the settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.
    Usable Region Estimate for Assessing Practical Usability of Medical Image Segmentation Models. (arXiv:2207.00156v1 [eess.IV])
    We aim to quantitatively measure the practical usability of medical image segmentation models: to what extent, how often, and on which samples a model's predictions can be used/trusted. We first propose a measure, Correctness-Confidence Rank Correlation (CCRC), to capture how predictions' confidence estimates correlate with their correctness scores in rank. A model with a high value of CCRC means its prediction confidences reliably suggest which samples' predictions are more likely to be correct. Since CCRC does not capture the actual prediction correctness, it alone is insufficient to indicate whether a prediction model is both accurate and reliable to use in practice. Therefore, we further propose another method, Usable Region Estimate (URE), which simultaneously quantifies predictions' correctness and reliability of confidence assessments in one estimate. URE provides concrete information on to what extent a model's predictions are usable. In addition, the sizes of usable regions (UR) can be utilized to compare models: A model with a larger UR can be taken as a more usable and hence better model. Experiments on six datasets validate that the proposed evaluation methods perform well, providing a concrete and concise measure for the practical usability of medical image segmentation models. Code is made available at https://github.com/yizhezhang2000/ure.
    AI in 6G: Energy-Efficient Distributed Machine Learning for Multilayer Heterogeneous Networks. (arXiv:2207.00415v1 [cs.NI])
    Adept network management is key for supporting extremely heterogeneous applications with stringent quality of service (QoS) requirements; this is more so when envisioning the complex and ultra-dense 6G mobile heterogeneous network (HetNet). From both the environmental and economical perspectives, non-homogeneous QoS demands obstruct the minimization of the energy footprints and operational costs of the envisioned robust networks. As such, network intelligentization is expected to play an essential role in the realization of such sophisticated aims. The fusion of artificial intelligence (AI) and mobile networks will allow for the dynamic and automatic configuration of network functionalities. Machine learning (ML), one of the backbones of AI, will be instrumental in forecasting changes in network loads and resource utilization, estimating channel conditions, optimizing network slicing, and enhancing security and encryption. However, it is well known that ML tasks themselves incur massive computational burdens and energy costs. To overcome such obstacles, we propose a novel layer-based HetNet architecture which optimally distributes tasks associated with different ML approaches across network layers and entities; such a HetNet boasts multiple access schemes as well as device-to-device (D2D) communications to enhance energy efficiency via collaborative learning and communications.
    MotionMixer: MLP-based 3D Human Body Pose Forecasting. (arXiv:2207.00499v1 [cs.CV])
    In this work, we present MotionMixer, an efficient 3D human body pose forecasting model based solely on multi-layer perceptrons (MLPs). MotionMixer learns the spatial-temporal 3D body pose dependencies by sequentially mixing both modalities. Given a stacked sequence of 3D body poses, a spatial-MLP extracts fine grained spatial dependencies of the body joints. The interaction of the body joints over time is then modelled by a temporal MLP. The spatial-temporal mixed features are finally aggregated and decoded to obtain the future motion. To calibrate the influence of each time step in the pose sequence, we make use of squeeze-and-excitation (SE) blocks. We evaluate our approach on Human3.6M, AMASS, and 3DPW datasets using the standard evaluation protocols. For all evaluations, we demonstrate state-of-the-art performance, while having a model with a smaller number of parameters. Our code is available at: https://github.com/MotionMLP/MotionMixer
    Visual Transformer Meets CutMix for Improved Accuracy, Communication Efficiency, and Data Privacy in Split Learning. (arXiv:2207.00234v1 [cs.LG])
    This article seeks for a distributed learning solution for the visual transformer (ViT) architectures. Compared to convolutional neural network (CNN) architectures, ViTs often have larger model sizes, and are computationally expensive, making federated learning (FL) ill-suited. Split learning (SL) can detour this problem by splitting a model and communicating the hidden representations at the split-layer, also known as smashed data. Notwithstanding, the smashed data of ViT are as large as and as similar as the input data, negating the communication efficiency of SL while violating data privacy. To resolve these issues, we propose a new form of CutSmashed data by randomly punching and compressing the original smashed data. Leveraging this, we develop a novel SL framework for ViT, coined CutMixSL, communicating CutSmashed data. CutMixSL not only reduces communication costs and privacy leakage, but also inherently involves the CutMix data augmentation, improving accuracy and scalability. Simulations corroborate that CutMixSL outperforms baselines such as parallelized SL and SplitFed that integrates FL with SL.
    Visual Pre-training for Navigation: What Can We Learn from Noise?. (arXiv:2207.00052v1 [cs.CV])
    A powerful paradigm for sensorimotor control is to predict actions from observations directly. Training such an end-to-end system allows representations that are useful for the downstream tasks to emerge automatically. In visual navigation, an agent can learn to navigate without any manual designs by correlating how its views change with the actions being taken. However, the lack of inductive bias makes this system data-inefficient and impractical in scenarios like search and rescue, where interacting with the environment to collect data is costly. We hypothesize a sufficient representation of the current view and the goal view for a navigation policy can be learned by predicting the location and size of a crop of the current view that corresponds to the goal. We further show that training such random crop prediction in a self-supervised fashion purely on random noise images transfers well to natural home images. The learned representation can then be bootstrapped to learn a navigation policy efficiently with little interaction data. Code is available at https://github.com/yanweiw/noise2ptz.
    FLVoogd: Robust And Privacy Preserving Federated Learning. (arXiv:2207.00428v1 [cs.CR])
    In this work, we propose FLVoogd, an updated federated learning method in which servers and clients collaboratively eliminate Byzantine attacks while preserving privacy. In particular, servers use automatic Density-based Spatial Clustering of Applications with Noise (DBSCAN) combined with S2PC to cluster the benign majority without acquiring sensitive personal information. Meanwhile, clients build dual models and perform test-based distance controlling to adjust their local models toward the global one to achieve personalizing. Our framework is automatic and adaptive that servers/clients don't need to tune the parameters during the training. In addition, our framework leverages Secure Multi-party Computation (SMPC) operations, including multiplications, additions, and comparison, where costly operations, like division and square root, are not required. Evaluations are carried out on some conventional datasets from the image classification field. The result shows that FLVoogd can effectively reject malicious uploads in most scenarios; meanwhile, it avoids data leakage from the server-side.
    Weakly-supervised High-fidelity Ultrasound Video Synthesis with Feature Decoupling. (arXiv:2207.00474v1 [cs.CV])
    Ultrasound (US) is widely used for its advantages of real-time imaging, radiation-free and portability. In clinical practice, analysis and diagnosis often rely on US sequences rather than a single image to obtain dynamic anatomical information. This is challenging for novices to learn because practicing with adequate videos from patients is clinically unpractical. In this paper, we propose a novel framework to synthesize high-fidelity US videos. Specifically, the synthesis videos are generated by animating source content images based on the motion of given driving videos. Our highlights are three-fold. First, leveraging the advantages of self- and fully-supervised learning, our proposed system is trained in weakly-supervised manner for keypoint detection. These keypoints then provide vital information for handling complex high dynamic motions in US videos. Second, we decouple content and texture learning using the dual decoders to effectively reduce the model learning difficulty. Last, we adopt the adversarial training strategy with GAN losses for further improving the sharpness of the generated videos, narrowing the gap between real and synthesis videos. We validate our method on a large in-house pelvic dataset with high dynamic motion. Extensive evaluation metrics and user study prove the effectiveness of our proposed method.
    Stain Isolation-based Guidance for Improved Stain Translation. (arXiv:2207.00431v1 [cs.CV])
    Unsupervised and unpaired domain translation using generative adversarial neural networks, and more precisely CycleGAN, is state of the art for the stain translation of histopathology images. It often, however, suffers from the presence of cycle-consistent but non structure-preserving errors. We propose an alternative approach to the set of methods which, relying on segmentation consistency, enable the preservation of pathology structures. Focusing on immunohistochemistry (IHC) and multiplexed immunofluorescence (mIF), we introduce a simple yet effective guidance scheme as a loss function that leverages the consistency of stain translation with stain isolation. Qualitative and quantitative experiments show the ability of the proposed approach to improve translation between the two domains.
    Modularity Optimization as a Training Criterion for Graph Neural Networks. (arXiv:2207.00107v1 [cs.LG])
    Graph convolution is a recent scalable method for performing deep feature learning on attributed graphs by aggregating local node information over multiple layers. Such layers only consider attribute information of node neighbors in the forward model and do not incorporate knowledge of global network structure in the learning task. In particular, the modularity function provides a convenient source of information about the community structure of networks. In this work we investigate the effect on the quality of learned representations by the incorporation of community structure preservation objectives of networks in the graph convolutional model. We incorporate the objectives in two ways, through an explicit regularization term in the cost function in the output layer and as an additional loss term computed via an auxiliary layer. We report the effect of community structure preserving terms in the graph convolutional architectures. Experimental evaluation on two attributed bibilographic networks showed that the incorporation of the community-preserving objective improves semi-supervised node classification accuracy in the sparse label regime.
    ProSelfLC: Progressive Self Label Correction Towards A Low-Temperature Entropy State. (arXiv:2207.00118v1 [cs.LG])
    To train robust deep neural networks (DNNs), we systematically study several target modification approaches, which include output regularisation, self and non-self label correction (LC). Three key issues are discovered: (1) Self LC is the most appealing as it exploits its own knowledge and requires no extra models. However, how to automatically decide the trust degree of a learner as training goes is not well answered in the literature. (2) Some methods penalise while the others reward low-entropy predictions, prompting us to ask which one is better. (3) Using the standard training setting, a trained network is of low confidence when severe noise exists, making it hard to leverage its high-entropy self knowledge. To resolve the issue (1), taking two well-accepted propositions--deep neural networks learn meaningful patterns before fitting noise and minimum entropy regularisation principle--we propose a novel end-to-end method named ProSelfLC, which is designed according to learning time and entropy. Specifically, given a data point, we progressively increase trust in its predicted label distribution versus its annotated one if a model has been trained for enough time and the prediction is of low entropy (high confidence). For the issue (2), according to ProSelfLC, we empirically prove that it is better to redefine a meaningful low-entropy status and optimise the learner toward it. This serves as a defence of entropy minimisation. To address the issue (3), we decrease the entropy of self knowledge using a low temperature before exploiting it to correct labels, so that the revised labels redefine a low-entropy target state. We demonstrate the effectiveness of ProSelfLC through extensive experiments in both clean and noisy settings, and on both image and protein datasets. Furthermore, our source code is available at https://github.com/XinshaoAmosWang/ProSelfLC-AT.
    Reliable Representations Make A Stronger Defender: Unsupervised Structure Refinement for Robust GNN. (arXiv:2207.00012v1 [cs.LG])
    Benefiting from the message passing mechanism, Graph Neural Networks (GNNs) have been successful on flourish tasks over graph data. However, recent studies have shown that attackers can catastrophically degrade the performance of GNNs by maliciously modifying the graph structure. A straightforward solution to remedy this issue is to model the edge weights by learning a metric function between pairwise representations of two end nodes, which attempts to assign low weights to adversarial edges. The existing methods use either raw features or representations learned by supervised GNNs to model the edge weights. However, both strategies are faced with some immediate problems: raw features cannot represent various properties of nodes (e.g., structure information), and representations learned by supervised GNN may suffer from the poor performance of the classifier on the poisoned graph. We need representations that carry both feature information and as mush correct structure information as possible and are insensitive to structural perturbations. To this end, we propose an unsupervised pipeline, named STABLE, to optimize the graph structure. Finally, we input the well-refined graph into a downstream classifier. For this part, we design an advanced GCN that significantly enhances the robustness of vanilla GCN without increasing the time complexity. Extensive experiments on four real-world graph benchmarks demonstrate that STABLE outperforms the state-of-the-art methods and successfully defends against various attacks.
    Variational Autoencoder Assisted Neural Network Likelihood RSRP Prediction Model. (arXiv:2207.00166v1 [cs.NI])
    Measuring customer experience on mobile data is of utmost importance for global mobile operators. The reference signal received power (RSRP) is one of the important indicators for current mobile network management, evaluation and monitoring. Radio data gathered through the minimization of drive test (MDT), a 3GPP standard technique, is commonly used for radio network analysis. Collecting MDT data in different geographical areas is inefficient and constrained by the terrain conditions and user presence, hence is not an adequate technique for dynamic radio environments. In this paper, we study a generative model for RSRP prediction, exploiting MDT data and a digital twin (DT), and propose a data-driven, two-tier neural network (NN) model. In the first tier, environmental information related to user equipment (UE), base stations (BS) and network key performance indicators (KPI) are extracted through a variational autoencoder (VAE). The second tier is designed as a likelihood model. Here, the environmental features and real MDT data features are adopted, formulating an integrated training process. On validation, our proposed model that uses real-world data demonstrates an accuracy improvement of about 20% or more compared with the empirical model and about 10% when compared with a fully connected prediction network.
    WNet: A data-driven dual-domain denoising model for sparse-view computed tomography with a trainable reconstruction layer. (arXiv:2207.00400v1 [eess.IV])
    Deep learning based solutions are being succesfully implemented for a wide variety of applications. Most notably, clinical use-cases have gained an increased interest and have been the main driver behind some of the cutting-edge data-driven algorithms proposed in the last years. For applications like sparse-view tomographic reconstructions, where the amount of measurement data is small in order to keep acquisition times short and radiation dose low, reduction of the streaking artifacts has prompted the development of data-driven denoising algorithms with the main goal of obtaining diagnostically viable images with only a subset of a full-scan data. We propose WNet, a data-driven dual-domain denoising model which contains a trainable reconstruction layer for sparse-view artifact denoising. Two encoder-decoder networks perform denoising in both sinogram- and reconstruction-domain simultaneously, while a third layer implementing the Filtered Backprojection algorithm is sandwiched between the first two and takes care of the reconstruction operation. We investigate the performance of the network on sparse-view chest CT scans, and we highlight the added benefit of having a trainable reconstruction layer over the more conventional fixed ones. We train and test our network on two clinically relevant datasets and we compare the obtained results with three different types of sparse-view CT denoising and reconstruction algorithms.
    Effect of Homomorphic Encryption on the Performance of Training Federated Learning Generative Adversarial Networks. (arXiv:2207.00263v1 [cs.CR])
    A Generative Adversarial Network (GAN) is a deep-learning generative model in the field of Machine Learning (ML) that involves training two Neural Networks (NN) using a sizable data set. In certain fields, such as medicine, the training data may be hospital patient records that are stored across different hospitals. The classic centralized approach would involve sending the data to a centralized server where the model would be trained. However, that would involve breaching the privacy and confidentiality of the patients and their data, which would be unacceptable. Therefore, Federated Learning (FL), an ML technique that trains ML models in a distributed setting without data ever leaving the host device, would be a better alternative to the centralized option. In this ML technique, only parameters and certain metadata would be communicated. In spite of that, there still exist attacks that can infer user data using the parameters and metadata. A fully privacy-preserving solution involves homomorphically encrypting (HE) the data communicated. This paper will focus on the performance loss of training an FL-GAN with three different types of Homomorphic Encryption: Partial Homomorphic Encryption (PHE), Somewhat Homomorphic Encryption (SHE), and Fully Homomorphic Encryption (FHE). We will also test the performance loss of Multi-Party Computations (MPC), as it has homomorphic properties. The performances will be compared to the performance of training an FL-GAN without encryption as well. Our experiments show that the more complex the encryption method is, the longer it takes, with the extra time taken for HE is quite significant in comparison to the base case of FL.
    DarKnight: An Accelerated Framework for Privacy and Integrity Preserving Deep Learning Using Trusted Hardware. (arXiv:2207.00083v1 [cs.CR])
    Privacy and security-related concerns are growing as machine learning reaches diverse application domains. The data holders want to train or infer with private data while exploiting accelerators, such as GPUs, that are hosted in the cloud. Cloud systems are vulnerable to attackers that compromise the privacy of data and integrity of computations. Tackling such a challenge requires unifying theoretical privacy algorithms with hardware security capabilities. This paper presents DarKnight, a framework for large DNN training while protecting input privacy and computation integrity. DarKnight relies on cooperative execution between trusted execution environments (TEE) and accelerators, where the TEE provides privacy and integrity verification, while accelerators perform the bulk of the linear algebraic computation to optimize the performance. In particular, DarKnight uses a customized data encoding strategy based on matrix masking to create input obfuscation within a TEE. The obfuscated data is then offloaded to GPUs for fast linear algebraic computation. DarKnight's data obfuscation strategy provides provable data privacy and computation integrity in the cloud servers. While prior works tackle inference privacy and cannot be utilized for training, DarKnight's encoding scheme is designed to support both training and inference.
    Cactus Mechanisms: Optimal Differential Privacy Mechanisms in the Large-Composition Regime. (arXiv:2207.00420v1 [cs.CR])
    Most differential privacy mechanisms are applied (i.e., composed) numerous times on sensitive data. We study the design of optimal differential privacy mechanisms in the limit of a large number of compositions. As a consequence of the law of large numbers, in this regime the best privacy mechanism is the one that minimizes the Kullback-Leibler divergence between the conditional output distributions of the mechanism given two different inputs. We formulate an optimization problem to minimize this divergence subject to a cost constraint on the noise. We first prove that additive mechanisms are optimal. Since the optimization problem is infinite dimensional, it cannot be solved directly; nevertheless, we quantize the problem to derive near-optimal additive mechanisms that we call "cactus mechanisms" due to their shape. We show that our quantization approach can be arbitrarily close to an optimal mechanism. Surprisingly, for quadratic cost, the Gaussian mechanism is strictly sub-optimal compared to this cactus mechanism. Finally, we provide numerical results which indicate that cactus mechanism outperforms the Gaussian mechanism for a finite number of compositions.
    A Rare Topic Discovery Model for Short Texts Based on Co-occurrence word Network. (arXiv:2207.00432v1 [cs.IR])
    We provide a simple and general solution for the discovery of scarce topics in unbalanced short-text datasets, namely, a word co-occurrence network-based model CWIBTD, which can simultaneously address the sparsity and unbalance of short-text topics and attenuate the effect of occasional pairwise occurrences of words, allowing the model to focus more on the discovery of scarce topics. Unlike previous approaches, CWIBTD uses co-occurrence word networks to model the topic distribution of each word, which improves the semantic density of the data space and ensures its sensitivity in identify-ing rare topics by improving the way node activity is calculated and normal-izing scarce topics and large topics to some extent. In addition, using the same Gibbs sampling as LDA makes CWIBTD easy to be extended to vari-ous application scenarios. Extensive experimental validation in the unbal-anced short text dataset confirms the superiority of CWIBTD over the base-line approach in discovering rare topics. Our model can be used for early and accurate discovery of emerging topics or unexpected events on social platforms.
    Learning Subject-Invariant Representations from Speech-Evoked EEG Using Variational Autoencoders. (arXiv:2207.00323v1 [eess.AS])
    The electroencephalogram (EEG) is a powerful method to understand how the brain processes speech. Linear models have recently been replaced for this purpose with deep neural networks and yield promising results. In related EEG classification fields, it is shown that explicitly modeling subject-invariant features improves generalization of models across subjects and benefits classification accuracy. In this work, we adapt factorized hierarchical variational autoencoders to exploit parallel EEG recordings of the same stimuli. We model EEG into two disentangled latent spaces. Subject accuracy reaches 98.96% and 1.60% on respectively the subject and content latent space, whereas binary content classification experiments reach an accuracy of 51.51% and 62.91% on respectively the subject and content latent space.
    Can we learn from developer mistakes? Learning to localize and repair real bugs from real bug fixes. (arXiv:2207.00301v1 [cs.SE])
    Real bug fixes found in open source repositories seem to be the perfect source for learning to localize and repair real bugs. However, the absence of large scale bug fix collections has made it difficult to effectively exploit real bug fixes in the training of larger neural models in the past. In contrast, artificial bugs -- produced by mutating existing source code -- can be easily obtained at a sufficient scale and are therefore often preferred in the training of existing approaches. Still, localization and repair models that are trained on artificial bugs usually underperform when faced with real bugs. This raises the question whether bug localization and repair models trained on real bug fixes are more effective in localizing and repairing real bugs. We address this question by introducing RealiT, a pre-train-and-fine-tune approach for effectively learning to localize and repair real bugs from real bug fixes. RealiT is first pre-trained on a large number of artificial bugs produced by traditional mutation operators and then fine-tuned on a smaller set of real bug fixes. Fine-tuning does not require any modifications of the learning algorithm and hence can be easily adopted in various training scenarios for bug localization or repair (even when real training data is scarce). In addition, we found that training on real bug fixes with RealiT is empirically powerful by nearly doubling the localization performance of an existing model on real bugs while maintaining or even improving the repair performance.
    Ranking in Contextual Multi-Armed Bandits. (arXiv:2207.00109v1 [stat.ML])
    We study a ranking problem in the contextual multi-armed bandit setting. A learning agent selects an ordered list of items at each time step and observes stochastic outcomes for each position. In online recommendation systems, showing an ordered list of the most attractive items would not be the best choice since both position and item dependencies result in a complicated reward function. A very naive example is the lack of diversity when all the most attractive items are from the same category. We model position and item dependencies in the ordered list and design UCB and Thompson Sampling type algorithms for this problem. We prove that the regret bound over $T$ rounds and $L$ positions is $\Tilde{O}(L\sqrt{d T})$, which has the same order as the previous works with respect to $T$ and only increases linearly with $L$. Our work generalizes existing studies in several directions, including position dependencies where position discount is a particular case, and proposes a more general contextual bandit model.
    Advances in Prediction of Readmission Rates Using Long Term Short Term Memory Networks on Healthcare Insurance Data. (arXiv:2207.00066v1 [cs.LG])
    30-day hospital readmission is a long standing medical problem that affects patients' morbidity and mortality and costs billions of dollars annually. Recently, machine learning models have been created to predict risk of inpatient readmission for patients with specific diseases, however no model exists to predict this risk across all patients. We developed a bi-directional Long Short Term Memory (LSTM) Network that is able to use readily available insurance data (inpatient visits, outpatient visits, and drug prescriptions) to predict 30 day re-admission for any admitted patient, regardless of reason. The top-performing model achieved an ROC AUC of 0.763 (0.011) when using historical, inpatient, and post-discharge data. The LSTM model significantly outperformed a baseline random forest classifier, indicating that understanding the sequence of events is important for model prediction. Incorporation of 30-days of historical data also significantly improved model performance compared to inpatient data alone, indicating that a patients clinical history prior to admission, including outpatient visits and pharmacy data is a strong contributor to readmission. Our results demonstrate that a machine learning model is able to predict risk of inpatient readmission with reasonable accuracy for all patients using structured insurance billing data. Because billing data or equivalent surrogates can be extracted from sites, such a model could be deployed to identify patients at risk for readmission before they are discharged, or to assign more robust follow up (closer follow up, home health, mailed medications) to at-risk patients after discharge.
    Variational Inference for Additive Main and Multiplicative Interaction Effects Models. (arXiv:2207.00011v1 [stat.ML])
    In plant breeding the presence of a genotype by environment (GxE) interaction has a strong impact on cultivation decision making and the introduction of new crop cultivars. The combination of linear and bilinear terms has been shown to be very useful in modelling this type of data. A widely-used approach to identify GxE is the Additive Main Effects and Multiplicative Interaction Effects (AMMI) model. However, as data frequently can be high-dimensional, Markov chain Monte Carlo (MCMC) approaches can be computationally infeasible. In this article, we consider a variational inference approach for such a model. We derive variational approximations for estimating the parameters and we compare the approximations to MCMC using both simulated and real data. The new inferential framework we propose is on average two times faster whilst maintaining the same predictive performance as MCMC.
    MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models. (arXiv:2207.00056v1 [cs.LG])
    The promise of multimodal models for real-world applications has inspired research in visualizing and understanding their internal mechanics with the end goal of empowering stakeholders to visualize model behavior, perform model debugging, and promote trust in machine learning models. However, modern multimodal models are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize the internal modeling of multimodal interactions in these models? Our paper aims to fill this gap by proposing MultiViz, a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages: (1) unimodal importance: how each modality contributes towards downstream modeling and prediction, (2) cross-modal interactions: how different modalities relate with each other, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction. MultiViz is designed to operate on diverse modalities, models, tasks, and research areas. Through experiments on 8 trained models across 6 real-world tasks, we show that the complementary stages in MultiViz together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MultiViz is publicly available, will be regularly updated with new interpretation tools and metrics, and welcomes inputs from the community.
    Explainable Empirical Risk Minimization. (arXiv:2009.01492v3 [cs.LG] UPDATED)
    The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decision-making that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user feedback. The user feedback might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.
    Discriminator-Guided Model-Based Offline Imitation Learning. (arXiv:2207.00244v1 [cs.LG])
    Offline imitation learning (IL) is a powerful method to solve decision-making problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data due to covariate shift. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperative-yet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.
    Transferable Graph Backdoor Attack. (arXiv:2207.00425v1 [cs.CR])
    Graph Neural Networks (GNNs) have achieved tremendous success in many graph mining tasks, benefitting from the message passing strategy that fuses the local structure and node features for much better graph representation learning. Despite the excellent performance of GNNs, but similar to other type of deep neural networks, the robustness of GNNs is unsatisfactory. It have been disclosed by many works that GNNs are vulnerable to unnoticeable perturbations on both graph structure and node features. Many adversarial attacks have been proposed to disclose the fragility of GNNs under different perturbation strategies to create adversarial examples. However, less work has been done to show the vulnerability of GNNs under backdoor attack. To fill this gap, in this paper, we present GHAT, transferable GrapH bAckdoor aTtack. The core principle of GHAT is to poison training dataset with perturbation triggers that can lead to effective and transferable backdoor attack. The perturbation trigger for a graph is generated by performing the perturbation actions on the graph structure via a gradient based score matrix. Compared with the prior works, GHAT is different in several ways: it exploits a surrogate GCN model to generate perturbation trigger for black-box based backdoor attack; it generates sample-specific perturbation triggers which do not have fixed pattern; the attack of GHAT can be transferable to different GNN models when trained with the poisoned training dataset forged by GHAT. Through extensive evaluation on four real-world datasets, we demonstrate that GHAT shows much better attack effectiveness in regard to transferable backdoor attack on GNNs.
    Threat Assessment in Machine Learning based Systems. (arXiv:2207.00091v1 [cs.CR])
    Machine learning is a field of artificial intelligence (AI) that is becoming essential for several critical systems, making it a good target for threat actors. Threat actors exploit different Tactics, Techniques, and Procedures (TTPs) against the confidentiality, integrity, and availability of Machine Learning (ML) systems. During the ML cycle, they exploit adversarial TTPs to poison data and fool ML-based systems. In recent years, multiple security practices have been proposed for traditional systems but they are not enough to cope with the nature of ML-based systems. In this paper, we conduct an empirical study of threats reported against ML-based systems with the aim to understand and characterize the nature of ML threats and identify common mitigation strategies. The study is based on 89 real-world ML attack scenarios from the MITRE's ATLAS database, the AI Incident Database, and the literature; 854 ML repositories from the GitHub search and the Python Packaging Advisory database, selected based on their reputation. Attacks from the AI Incident Database and the literature are used to identify vulnerabilities and new types of threats that were not documented in ATLAS. Results show that convolutional neural networks were one of the most targeted models among the attack scenarios. ML repositories with the largest vulnerability prominence include TensorFlow, OpenCV, and Notebook. In this paper, we also report the most frequent vulnerabilities in the studied ML repositories, the most targeted ML phases and models, the most used TTPs in ML phases and attack scenarios. This information is particularly important for red/blue teams to better conduct attacks/defenses, for practitioners to prevent threats during ML development, and for researchers to develop efficient defense mechanisms.
    Discrimination in machine learning algorithms. (arXiv:2207.00108v1 [stat.ML])
    Machine learning algorithms are routinely used for business decisions that may directly affect individuals, for example, because a credit scoring algorithm refuses them a loan. It is then relevant from an ethical (and legal) point of view to ensure that these algorithms do not discriminate based on sensitive attributes (like sex or race), which may occur unwittingly and unknowingly by the operator and the management. Statistical tools and methods are then required to detect and eliminate such potential biases.
    Rethinking Optimization with Differentiable Simulation from a Global Perspective. (arXiv:2207.00167v1 [stat.ML])
    Differentiable simulation is a promising toolkit for fast gradient-based policy optimization and system identification. However, existing approaches to differentiable simulation have largely tackled scenarios where obtaining smooth gradients has been relatively easy, such as systems with mostly smooth dynamics. In this work, we study the challenges that differentiable simulation presents when it is not feasible to expect that a single descent reaches a global optimum, which is often a problem in contact-rich scenarios. We analyze the optimization landscapes of diverse scenarios that contain both rigid bodies and deformable objects. In dynamic environments with highly deformable objects and fluids, differentiable simulators produce rugged landscapes with nonetheless useful gradients in some parts of the space. We propose a method that combines Bayesian optimization with semi-local 'leaps' to obtain a global search method that can use gradients effectively, while also maintaining robust performance in regions with noisy gradients. We show that our approach outperforms several gradient-based and gradient-free baselines on an extensive set of experiments in simulation, and also validate the method using experiments with a real robot and deformables. Videos and supplementary materials are available at https://tinyurl.com/globdiff
    Image features of a splashing drop on a solid surface extracted using a feedforward neural network. (arXiv:2201.09541v1 [physics.flu-dyn] CROSS LISTED)
    This article reports nonintuitive characteristic of a splashing drop on a solid surface discovered through extracting image features using a feedforward neural network (FNN). Ethanol of area-equivalent radius about 1.29 mm was dropped from impact heights ranging from 4 cm to 60 cm (splashing threshold 20 cm) and impacted on a hydrophilic surface. The images captured when half of the drop impacted the surface were labeled according to their outcome, splashing or nonsplashing, and were used to train an FNN. A classification accuracy higher than 96% was achieved. To extract the image features identified by the FNN for classification, the weight matrix of the trained FNN for identifying splashing drops was visualized. Remarkably, the visualization showed that the trained FNN identified the contour height of the main body of the impacting drop as an important characteristic differentiating between splashing and nonsplashing drops, which has not been reported in previous studies. This feature was found throughout the impact, even when one and three-quarters of the drop impacted the surface. To confirm the importance of this image feature, the FNN was retrained to classify using only the main body without checking for the presence of ejected secondary droplets. The accuracy was still higher than 82%, confirming that the contour height is an important feature distinguishing splashing from nonsplashing drops. Several aspects of drop impact are analyzed and discussed with the aim of identifying the possible mechanism underlying the difference in contour height between splashing and nonsplashing drops.
    Robust Bayesian Learning for Reliable Wireless AI: Framework and Applications. (arXiv:2207.00300v1 [cs.LG])
    This work takes a critical look at the application of conventional machine learning methods to wireless communication problems through the lens of reliability and robustness. Deep learning techniques adopt a frequentist framework, and are known to provide poorly calibrated decisions that do not reproduce the true uncertainty caused by limitations in the size of the training data. Bayesian learning, while in principle capable of addressing this shortcoming, is in practice impaired by model misspecification and by the presence of outliers. Both problems are pervasive in wireless communication settings, in which the capacity of machine learning models is subject to resource constraints and training data is affected by noise and interference. In this context, we explore the application of the framework of robust Bayesian learning. After a tutorial-style introduction to robust Bayesian learning, we showcase the merits of robust Bayesian learning on several important wireless communication problems in terms of accuracy, calibration, and robustness to outliers and misspecification.
    Adversarial Robustness is at Odds with Lazy Training. (arXiv:2207.00411v1 [cs.CR])
    Recent works show that random neural networks are vulnerable against adversarial attacks [Daniely and Schacham, 2020] and that such attacks can be easily found using a single step of gradient descent [Bubeck et al., 2021]. In this work, we take it one step further and show that a single gradient step can find adversarial examples for networks trained in the so-called lazy regime. This regime is interesting because even though the neural network weights remain close to the initialization, there exist networks with small generalization error, which can be found efficiently using first-order methods. Our work challenges the model of the lazy regime, the dominant regime in which neural networks are provably efficiently learnable. We show that the networks trained in this regime, even though they enjoy good theoretical computational guarantees, remain vulnerable to adversarial examples. To the best of our knowledge, this is the first work to prove that such well-generalizable neural networks are still vulnerable to adversarial attacks.
    DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. (arXiv:2207.00032v1 [cs.LG])
    The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50\%$ of A6000 peak).
    Automated Quantum Circuit Design with Nested Monte Carlo Tree Search. (arXiv:2207.00132v1 [quant-ph])
    Quantum algorithms based on variational approaches are one of the most promising methods to construct quantum solutions and have found a myriad of applications in the last few years. Despite the adaptability and simplicity, their scalability and the selection of suitable ans\"atzs remain key challenges. In this work, we report an algorithmic framework based on nested Monte-Carlo Tree Search (MCTS) coupled with the combinatorial multi-armed bandit (CMAB) model for the automated design of quantum circuits. Through numerical experiments, we demonstrated our algorithm applied to various kinds of problems, including the ground energy problem in quantum chemistry, quantum optimisation on a graph, solving systems of linear equations, and finding encoding circuit for quantum error detection codes. Compared to the existing approaches, the results indicate that our circuit design algorithm can explore larger search spaces and optimise quantum circuits for larger systems, showing both versatility and scalability.
    Class Impression for Data-free Incremental Learning. (arXiv:2207.00005v1 [cs.CV])
    Standard deep learning-based classification approaches require collecting all samples from all classes in advance and are trained offline. This paradigm may not be practical in real-world clinical applications, where new classes are incrementally introduced through the addition of new data. Class incremental learning is a strategy allowing learning from such data. However, a major challenge is catastrophic forgetting, i.e., performance degradation on previous classes when adapting a trained model to new data. Prior methodologies to alleviate this challenge save a portion of training data require perpetual storage of such data that may introduce privacy issues. Here, we propose a novel data-free class incremental learning framework that first synthesizes data from the model trained on previous classes to generate a \ours. Subsequently, it updates the model by combining the synthesized data with new class data. Furthermore, we incorporate a cosine normalized Cross-entropy loss to mitigate the adverse effects of the imbalance, a margin loss to increase separation among previous classes and new ones, and an intra-domain contrastive loss to generalize the model trained on the synthesized data to real data. We compare our proposed framework with state-of-the-art methods in class incremental learning, where we demonstrate improvement in accuracy for the classification of 11,062 echocardiography cine series of patients.
    LaserMix for Semi-Supervised LiDAR Semantic Segmentation. (arXiv:2207.00026v1 [cs.CV])
    Densely annotating LiDAR point clouds is costly, which restrains the scalability of fully-supervised learning methods. In this work, we study the underexplored semi-supervised learning (SSL) in LiDAR segmentation. Our core idea is to leverage the strong spatial cues of LiDAR point clouds to better exploit unlabeled data. We propose LaserMix to mix laser beams from different LiDAR scans, and then encourage the model to make consistent and confident predictions before and after mixing. Our framework has three appealing properties: 1) Generic: LaserMix is agnostic to LiDAR representations (e.g., range view and voxel), and hence our SSL framework can be universally applied. 2) Statistically grounded: We provide a detailed analysis to theoretically explain the applicability of the proposed framework. 3) Effective: Comprehensive experimental analysis on popular LiDAR segmentation datasets (nuScenes, SemanticKITTI, and ScribbleKITTI) demonstrates our effectiveness and superiority. Notably, we achieve competitive results over fully-supervised counterparts with 2x to 5x fewer labels and improve the supervised-only baseline significantly by 10.8% on average. We hope this concise yet high-performing framework could facilitate future research in semi-supervised LiDAR segmentation. Code will be publicly available.
    Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml. (arXiv:2207.00559v1 [cs.LG])
    Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of two types of recurrent neural network layers -- long short-term memory and gated recurrent unit -- within the hls4ml framework. We demonstrate that our implementation is capable of producing effective designs for both small and large models, and can be customized to meet specific design requirements for inference latencies and FPGA resources. We show the performance and synthesized designs for multiple neural networks, many of which are trained specifically for jet identification tasks at the CERN Large Hadron Collider.
    DeepOPF: A Feasibility-Optimized Deep Neural Network Approach for AC Optimal Power Flow Problems. (arXiv:2007.01002v6 [eess.SY] UPDATED)
    High percentage penetrations of renewable energy generations introduce significant uncertainty into power systems. It requires grid operators to solve alternative current optimal power flow (AC-OPF) problems more frequently for economical and reliable operation in both transmission and distribution grids. In this paper, we develop a Deep Neural Network (DNN) approach, called DeepOPF, for solving AC-OPF problems in a fraction of the time used by conventional solvers. A key difficulty for applying machine learning techniques for solving AC-OPF problems lies in ensuring that the obtained solutions respect the equality and inequality physical and operational constraints. Generalized the 2-stage procedure in [1], [2], DeepOPF first trains a DNN model to predict a set of independent operating variables and then directly compute the remaining dependable ones by solving power flow equations. Such an approach not only preserves the power-flow balance equality constraints but also reduces the number of variables to predict by the DNN, cutting down the number of neurons and training data needed. DeepOPF then employs a penalty approach with a zero-order gradient estimation technique in the training process to preserve the remaining inequality constraints. As another contribution, we drive a condition for tuning the size of the DNN according to the desired approximation accuracy, which measures the DNN generalization capability. It provides theoretical justification for using DNN to solve the AC-OPF problem. Simulation results of IEEE 30/118/300-bus and a synthetic 2000-bus test cases show that DeepOPF speeds up the computing time by up to two orders of magnitude as compared to a state-of-the-art solver, at the expense of $<$0.1% cost difference.
    Training Novices: The Role of Human-AI Collaboration and Knowledge Transfer. (arXiv:2207.00497v1 [cs.HC])
    Across a multitude of work environments, expert knowledge is imperative for humans to conduct tasks with high performance and ensure business success. These humans possess task-specific expert knowledge (TSEK) and hence, represent subject matter experts (SMEs). However, not only demographic changes but also personnel downsizing strategies lead and will continue to lead to departures of SMEs within organizations, which constitutes the challenge of how to retain that expert knowledge and train novices to keep the competitive advantage elicited by that expert knowledge. SMEs training novices is time- and cost-intensive, which intensifies the need for alternatives. Human-AI collaboration (HAIC) poses a way out of this dilemma, facilitating alternatives to preserve expert knowledge and teach it to novices for tasks conducted by SMEs beforehand. In this workshop paper, we (1) propose a framework on how HAIC can be utilized to train novices on particular tasks, (2) illustrate the role of explicit and tacit knowledge in this training process via HAIC, and (3) outline a preliminary experiment design to assess the ability of AI systems in HAIC to act as a trainer to transfer TSEK to novices who do not possess prior TSEK.
    CVLight: Decentralized Learning for Adaptive Traffic Signal Control with Connected Vehicles. (arXiv:2104.10340v3 [cs.LG] UPDATED)
    This paper develops a decentralized reinforcement learning (RL) scheme for multi-intersection adaptive traffic signal control (TSC), called "CVLight", that leverages data collected from connected vehicles (CVs). The state and reward design facilitates coordination among agents and considers travel delays collected by CVs. A novel algorithm, Asymmetric Advantage Actor-critic (Asym-A2C), is proposed where both CV and non-CV information is used to train the critic network, while only CV information is used to execute optimal signal timing. Comprehensive experiments show the superiority of CVLight over state-of-the-art algorithms under a 2-by-2 synthetic road network with various traffic demand patterns and penetration rates. The learned policy is then visualized to further demonstrate the advantage of Asym-A2C. A pre-train technique is applied to improve the scalability of CVLight, which significantly shortens the training time and shows the advantage in performance under a 5-by-5 road network. A case study is performed on a 2-by-2 road network located in State College, Pennsylvania, USA, to further demonstrate the effectiveness of the proposed algorithm under real-world scenarios. Compared to other baseline models, the trained CVLight agent can efficiently control multiple intersections solely based on CV data and achieve the best performance, especially under low CV penetration rates.
    Unified Source-Filter GAN with Harmonic-plus-Noise Source Excitation Generation. (arXiv:2205.06053v2 [cs.SD] UPDATED)
    This paper introduces a unified source-filter network with a harmonic-plus-noise source excitation generation mechanism. In our previous work, we proposed unified Source-Filter GAN (uSFGAN) for developing a high-fidelity neural vocoder with flexible voice controllability using a unified source-filter neural network architecture. However, the capability of uSFGAN to model the aperiodic source excitation signal is insufficient, and there is still a gap in sound quality between the natural and generated speech. To improve the source excitation modeling and generated sound quality, a new source excitation generation network separately generating periodic and aperiodic components is proposed. The advanced adversarial training procedure of HiFiGAN is also adopted to replace that of Parallel WaveGAN used in the original uSFGAN. Both objective and subjective evaluation results show that the modified uSFGAN significantly improves the sound quality of the basic uSFGAN while maintaining the voice controllability.
    Asynchronous Distributed Bayesian Optimization at HPC Scale. (arXiv:2207.00479v1 [cs.LG])
    Bayesian optimization (BO) is a widely used approach for computationally expensive black-box optimization such as simulator calibration and hyperparameter optimization of deep learning methods. In BO, a dynamically updated computationally cheap surrogate model is employed to learn the input-output relationship of the black-box function; this surrogate model is used to explore and exploit the promising regions of the input space. Multipoint BO methods adopt a single manager/multiple workers strategy to achieve high-quality solutions in shorter time. However, the computational overhead in multipoint generation schemes is a major bottleneck in designing BO methods that can scale to thousands of workers. We present an asynchronous-distributed BO (ADBO) method wherein each worker runs a search and asynchronously communicates the input-output values of black-box evaluations from all other workers without the manager. We scale our method up to 4,096 workers and demonstrate improvement in the quality of the solution and faster convergence. We demonstrate the effectiveness of our approach for tuning the hyperparameters of neural networks from the Exascale computing project CANDLE benchmarks.
    Latent Gaussian Model Boosting. (arXiv:2105.08966v5 [cs.LG] UPDATED)
    Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.
    A Neural-embedded Choice Model: TasteNet-MNL Modeling Taste Heterogeneity with Flexibility and Interpretability. (arXiv:2002.00922v2 [econ.EM] UPDATED)
    Discrete choice models (DCMs) require a priori knowledge of the utility functions, especially how tastes vary across individuals. Utility misspecification may lead to biased estimates, inaccurate interpretations and limited predictability. In this paper, we utilize a neural network to learn taste representation. Our formulation consists of two modules: a neural network (TasteNet) that learns taste parameters (e.g., time coefficient) as flexible functions of individual characteristics; and a multinomial logit (MNL) model with utility functions defined with expert knowledge. Taste parameters learned by the neural network are fed into the choice model and link the two modules. Our approach extends the L-MNL model (Sifringer et al., 2020) by allowing the neural network to learn the interactions between individual characteristics and alternative attributes. Moreover, we formalize and strengthen the interpretability condition - requiring realistic estimates of behavior indicators (e.g., value-of-time, elasticity) at the disaggregated level, which is crucial for a model to be suitable for scenario analysis and policy decisions. Through a unique network architecture and parameter transformation, we incorporate prior knowledge and guide the neural network to output realistic behavior indicators at the disaggregated level. We show that TasteNet-MNL reaches the ground-truth model's predictability and recovers the nonlinear taste functions on synthetic data. Its estimated value-of-time and choice elasticities at the individual level are close to the ground truth. On a publicly available Swissmetro dataset, TasteNet-MNL outperforms benchmarking MNLs and Mixed Logit model's predictability. It learns a broader spectrum of taste variations within the population and suggests a higher average value-of-time.
    The Fragility of Noise Estimation in Kalman Filter: Optimization Can Handle Model-Misspecification. (arXiv:2104.02372v4 [cs.LG] UPDATED)
    The Kalman Filter (KF) parameters are traditionally determined by noise estimation, since under the KF assumptions, the state prediction errors are minimized when the parameters correspond to the noise covariance. However, noise estimation remains the gold-standard regardless of the assumptions - even when it is not equivalent to errors minimization. We demonstrate that even seemingly simple problems may include multiple assumptions violations - which are sometimes hard to even notice. We show theoretically and empirically that even a minor violation may largely shift the optimal parameters. We propose a gradient-based method along with the Cholesky parameterization to explicitly optimize the state prediction errors. We show consistent improvement over noise estimation in tens of experiments in 3 different domains. Finally, we demonstrate that optimization makes the KF competitive with an LSTM model - even in non linear problems.
    Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning. (arXiv:2105.14074v3 [cs.AI] UPDATED)
    In robotic domains, learning and planning are complicated by continuous state spaces, continuous action spaces, and long task horizons. In this work, we address these challenges with Neuro-Symbolic Relational Transition Models (NSRTs), a novel class of models that are data-efficient to learn, compatible with powerful robotic planning methods, and generalizable over objects. NSRTs have both symbolic and neural components, enabling a bilevel planning scheme where symbolic AI planning in an outer loop guides continuous planning with neural models in an inner loop. Experiments in four robotic planning domains show that NSRTs can be learned after only tens or hundreds of training episodes, and then used for fast planning in new tasks that require up to 60 actions and involve many more objects than were seen during training. Video: https://tinyurl.com/chitnis-nsrts
    Deep Learning and Symbolic Regression for Discovering Parametric Equations. (arXiv:2207.00529v1 [cs.LG])
    Symbolic regression is a machine learning technique that can learn the governing formulas of data and thus has the potential to transform scientific discovery. However, symbolic regression is still limited in the complexity and dimensionality of the systems that it can analyze. Deep learning on the other hand has transformed machine learning in its ability to analyze extremely complex and high-dimensional datasets. We propose a neural network architecture to extend symbolic regression to parametric systems where some coefficient may vary but the structure of the underlying governing equation remains constant. We demonstrate our method on various analytic expressions, ODEs, and PDEs with varying coefficients and show that it extrapolates well outside of the training domain. The neural network-based architecture can also integrate with other deep learning architectures so that it can analyze high-dimensional data while being trained end-to-end. To this end we integrate our architecture with convolutional neural networks to analyze 1D images of varying spring systems.
    Language model compression with weighted low-rank factorization. (arXiv:2207.00112v1 [cs.LG])
    Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the trained model's task accuracy. We analyze this previously unexplored problem, make observations, and address it by introducing Fisher information to weigh the importance of parameters affecting the model prediction. This idea leads to our method: Fisher-Weighted SVD (FWSVD). Although the factorized matrices from our approach do not result in smaller reconstruction errors, we find that our resulting task accuracy is much closer to the original model's performance. We perform analysis with the transformer-based language models, showing our weighted SVD largely alleviates the mismatched optimization objectives and can maintain model performance with a higher compression rate. Our method can directly compress a task-specific model while achieving better performance than other compact model strategies requiring expensive model pre-training. Moreover, the evaluation of compressing an already compact model shows our method can further reduce 9% to 30% parameters with an insignificant impact on task accuracy.
    A Convergent and Dimension-Independent Min-Max Optimization Algorithm. (arXiv:2006.12376v6 [cs.LG] UPDATED)
    We study a variant of a recently introduced min-max optimization framework where the max-player is constrained to update its parameters in a greedy manner until it reaches a first-order stationary point. Our equilibrium definition for this framework depends on a proposal distribution which the min-player uses to choose directions in which to update its parameters. We show that, given a smooth and bounded nonconvex-nonconcave objective function, access to any proposal distribution for the min-player's updates, and stochastic gradient oracle for the max-player, our algorithm converges to the aforementioned approximate local equilibrium in a number of iterations that does not depend on the dimension. The equilibrium point found by our algorithm depends on the proposal distribution, and when applying our algorithm to train GANs we choose the proposal distribution to be a distribution of stochastic gradients. We empirically evaluate our algorithm on challenging nonconvex-nonconcave test-functions and loss functions arising in GAN training. Our algorithm converges on these test functions and, when used to train GANs, trains stably on synthetic and real-world datasets and avoids mode collapse
    Eccentric Regularization: Minimizing Hyperspherical Energy without explicit projection. (arXiv:2104.11610v2 [cs.LG] UPDATED)
    Several regularization methods have recently been introduced which force the latent activations of an autoencoder or deep neural network to conform to either a Gaussian or hyperspherical distribution, or to minimize the implicit rank of the distribution in latent space. In the present work, we introduce a novel regularizing loss function which simulates a pairwise repulsive force between items and an attractive force of each item toward the origin. We show that minimizing this loss function in isolation achieves a hyperspherical distribution. Moreover, when used as a regularizing term, the scaling factor can be adjusted to allow greater flexibility and tolerance of eccentricity, thus allowing the latent variables to be stratified according to their relative importance, while still promoting diversity. We apply this method of Eccentric Regularization to an autoencoder, and demonstrate its effectiveness in image generation, representation learning and downstream classification tasks.
    Near-Optimal High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise. (arXiv:2106.05958v2 [math.OC] UPDATED)
    Stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with H\"older-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.
    Border basis computation with gradient-weighted normalization. (arXiv:2101.00401v4 [cs.SC] UPDATED)
    Normalization of polynomials plays a vital role in the approximate basis computation of vanishing ideals. Coefficient normalization, which normalizes a polynomial with its coefficient norm, is the most common method in computer algebra. This study proposes the gradient-weighted normalization method for the approximate border basis computation of vanishing ideals, inspired by recent developments in machine learning. The data-dependent nature of gradient-weighted normalization leads to better stability against perturbation and consistency in the scaling of input points, which cannot be attained by coefficient normalization. Only a subtle change is needed to introduce gradient normalization in the existing algorithms with coefficient normalization. The analysis of algorithms still works with a small modification, and the order of magnitude of time complexity of algorithms remains unchanged. We also prove that, with coefficient normalization, which does not provide the scaling consistency property, scaling of points (e.g., as a preprocessing) can cause an approximate basis computation to fail. This study is the first to theoretically highlight the crucial effect of scaling in approximate basis computation and presents the utility of data-dependent normalization.
    Privacy-preserving Graph Analytics: Secure Generation and Federated Learning. (arXiv:2207.00048v1 [cs.CR])
    Directly motivated by security-related applications from the Homeland Security Enterprise, we focus on the privacy-preserving analysis of graph data, which provides the crucial capacity to represent rich attributes and relationships. In particular, we discuss two directions, namely privacy-preserving graph generation and federated graph learning, which can jointly enable the collaboration among multiple parties each possessing private graph data. For each direction, we identify both "quick wins" and "hard problems". Towards the end, we demonstrate a user interface that can facilitate model explanation, interpretation, and visualization. We believe that the techniques developed in these directions will significantly enhance the capabilities of the Homeland Security Enterprise to tackle and mitigate the various security risks.
    Generating Counterfactual Hard Negative Samples for Graph Contrastive Learning. (arXiv:2207.00148v1 [cs.LG])
    Graph contrastive learning has emerged as a powerful tool for unsupervised graph representation learning. The key to the success of graph contrastive learning is to acquire high-quality positive and negative samples as contrasting pairs for the purpose of learning underlying structural semantics of the input graph. Recent works usually sample negative samples from the same training batch with the positive samples, or from an external irrelevant graph. However, a significant limitation lies in such strategies, which is the unavoidable problem of sampling false negative samples. In this paper, we propose a novel method to utilize \textbf{C}ounterfactual mechanism to generate artificial hard negative samples for \textbf{G}raph \textbf{C}ontrastive learning, namely \textbf{CGC}, which has a different perspective compared to those sampling-based strategies. We utilize counterfactual mechanism to produce hard negative samples, which ensures that the generated samples are similar to, but have labels that different from the positive sample. The proposed method achieves satisfying results on several datasets compared to some traditional unsupervised graph learning methods and some SOTA graph contrastive learning methods. We also conduct some supplementary experiments to give an extensive illustration of the proposed method, including the performances of CGC with different hard negative samples and evaluations for hard negative samples generated with different similarity measurements.
    Robustness of Epinets against Distributional Shifts. (arXiv:2207.00137v1 [cs.LG])
    Recent work introduced the epinet as a new approach to uncertainty modeling in deep learning. An epinet is a small neural network added to traditional neural networks, which, together, can produce predictive distributions. In particular, using an epinet can greatly improve the quality of joint predictions across multiple inputs, a measure of how well a neural network knows what it does not know. In this paper, we examine whether epinets can offer similar advantages under distributional shifts. We find that, across ImageNet-A/O/C, epinets generally improve robustness metrics. Moreover, these improvements are more significant than those afforded by even very large ensembles at orders of magnitude lower computational costs. However, these improvements are relatively small compared to the outstanding issues in distributionally-robust deep learning. Epinets may be a useful tool in the toolbox, but they are far from the complete solution.
    Anisotropic, Sparse and Interpretable Physics-Informed Neural Networks for PDEs. (arXiv:2207.00377v1 [cs.LG])
    There has been a growing interest in the use of Deep Neural Networks (DNNs) to solve Partial Differential Equations (PDEs). Despite the promise that such approaches hold, there are various aspects where they could be improved. Two such shortcomings are (i) their computational inefficiency relative to classical numerical methods, and (ii) the non-interpretability of a trained DNN model. In this work we present ASPINN, an anisotropic extension of our earlier work called SPINN--Sparse, Physics-informed, and Interpretable Neural Networks--to solve PDEs that addresses both these issues. ASPINNs generalize radial basis function networks. We demonstrate using a variety of examples involving elliptic and hyperbolic PDEs that the special architecture we propose is more efficient than generic DNNs, while at the same time being directly interpretable. Further, they improve upon the SPINN models we proposed earlier in that fewer nodes are require to capture the solution using ASPINN than using SPINN, thanks to the anisotropy of the local zones of influence of each node. The interpretability of ASPINN translates to a ready visualization of their weights and biases, thereby yielding more insight into the nature of the trained model. This in turn provides a systematic procedure to improve the architecture based on the quality of the computed solution. ASPINNs thus serve as an effective bridge between classical numerical algorithms and modern DNN based methods to solve PDEs. In the process, we also streamline the training of ASPINNs into a form that is closer to that of supervised learning algorithms.
    An AO-ADMM approach to constraining PARAFAC2 on all modes. (arXiv:2110.01278v2 [cs.LG] UPDATED)
    Analyzing multi-way measurements with variations across one mode of the dataset is a challenge in various fields including data mining, neuroscience and chemometrics. For example, measurements may evolve over time or have unaligned time profiles. The PARAFAC2 model has been successfully used to analyze such data by allowing the underlying factor matrices in one mode (i.e., the evolving mode) to change across slices. The traditional approach to fit a PARAFAC2 model is to use an alternating least squares-based algorithm, which handles the constant cross-product constraint of the PARAFAC2 model by implicitly estimating the evolving factor matrices. This approach makes imposing regularization on these factor matrices challenging. There is currently no algorithm to flexibly impose such regularization with general penalty functions and hard constraints. In order to address this challenge and to avoid the implicit estimation, in this paper, we propose an algorithm for fitting PARAFAC2 based on alternating optimization with the alternating direction method of multipliers (AO-ADMM). With numerical experiments on simulated data, we show that the proposed PARAFAC2 AO-ADMM approach allows for flexible constraints, recovers the underlying patterns accurately, and is computationally efficient compared to the state-of-the-art. We also apply our model to two real-world datasets from neuroscience and chemometrics, and show that constraining the evolving mode improves the interpretability of the extracted patterns.
    FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning. (arXiv:2207.00555v1 [eess.AS])
    Large-scale speech self-supervised learning (SSL) has emerged to the main field of speech processing, however, the problem of computational cost arising from its vast size makes a high entry barrier to academia. In addition, existing distillation techniques of speech SSL models compress the model by reducing layers, which induces performance degradation in linguistic pattern recognition tasks such as phoneme recognition (PR). In this paper, we propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works. Moreover, we employ a time-reduction layer to speed up inference time and propose a method of hint-based distillation for less performance degradation. Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.
    ReLU Deep Neural Networks from the Hierarchical Basis Perspective. (arXiv:2105.04156v2 [math.NA] UPDATED)
    We study ReLU deep neural networks (DNNs) by investigating their connections with the hierarchical basis method in finite element methods. First, we show that the approximation schemes of ReLU DNNs for $x^2$ and $xy$ are composition versions of the hierarchical basis approximation for these two functions. Based on this fact, we obtain a geometric interpretation and systematic proof for the approximation result of ReLU DNNs for polynomials, which plays an important role in a series of recent exponential approximation results of ReLU DNNs. Through our investigation of connections between ReLU DNNs and the hierarchical basis approximation for $x^2$ and $xy$, we show that ReLU DNNs with this special structure can be applied only to approximate quadratic functions. Furthermore, we obtain a concise representation to explicitly reproduce any linear finite element function on a two-dimensional uniform mesh by using ReLU DNNs with only two hidden layers.
    Better Methods and Theory for Federated Learning: Compression, Client Selection and Heterogeneity. (arXiv:2207.00392v1 [cs.LG])
    Federated learning (FL) is an emerging machine learning paradigm involving multiple clients, e.g., mobile phone devices, with an incentive to collaborate in solving a machine learning problem coordinated by a central server. FL was proposed in 2016 by Kone\v{c}n\'{y} et al. and McMahan et al. as a viable privacy-preserving alternative to traditional centralized machine learning since, by construction, the training data points are decentralized and never transferred by the clients to a central server. Therefore, to a certain degree, FL mitigates the privacy risks associated with centralized data collection. Unfortunately, optimization for FL faces several specific issues that centralized optimization usually does not need to handle. In this thesis, we identify several of these challenges and propose new methods and algorithms to address them, with the ultimate goal of enabling practical FL solutions supported with mathematically rigorous guarantees.
    Simulating financial time series using attention. (arXiv:2207.00493v1 [q-fin.ST])
    Financial time series simulation is a central topic since it extends the limited real data for training and evaluation of trading strategies. It is also challenging because of the complex statistical properties of the real financial data. We introduce two generative adversarial networks (GANs), which utilize the convolutional networks with attention and the transformers, for financial time series simulation. The GANs learn the statistical properties in a data-driven manner and the attention mechanism helps to replicate the long-range dependencies. The proposed GANs are tested on the S&P 500 index and option data, examined by scores based on the stylized facts and are compared with the pure convolutional GAN, i.e. QuantGAN. The attention-based GANs not only reproduce the stylized facts, but also smooth the autocorrelation of returns.
    Off-the-grid learning of sparse mixtures from a continuous dictionary. (arXiv:2207.00171v1 [stat.ML])
    We consider a general non-linear model where the signal is a finite mixture of an unknown, possibly increasing, number of features issued from a continuous dictionary parameterized by a real nonlinear parameter. The signal is observed with Gaussian (possibly correlated) noise in either a continuous or a discrete setup. We propose an off-the-grid optimization method, that is, a method which does not use any discretization scheme on the parameter space, to estimate both the non-linear parameters of the features and the linear parameters of the mixture. We use recent results on the geometry of off-the-grid methods to give minimal separation on the true underlying non-linear parameters such that interpolating certificate functions can be constructed. Using also tail bounds for suprema of Gaussian processes we bound the prediction error with high probability. Assuming that the certificate functions can be constructed, our prediction error bound is up to log --factors similar to the rates attained by the Lasso predictor in the linear regression model. We also establish convergence rates that quantify with high probability the quality of estimation for both the linear and the non-linear parameters.
    Modular Lifelong Reinforcement Learning via Neural Composition. (arXiv:2207.00429v1 [cs.LG])
    Humans commonly solve complex problems by decomposing them into easier subproblems and then combining the subproblem solutions. This type of compositional reasoning permits reuse of the subproblem solutions when tackling future tasks that share part of the underlying compositional structure. In a continual or lifelong reinforcement learning (RL) setting, this ability to decompose knowledge into reusable components would enable agents to quickly learn new RL tasks by leveraging accumulated compositional structures. We explore a particular form of composition based on neural modules and present a set of RL problems that intuitively admit compositional solutions. Empirically, we demonstrate that neural composition indeed captures the underlying structure of this space of problems. We further propose a compositional lifelong RL method that leverages accumulated neural components to accelerate the learning of future tasks while retaining performance on previous tasks via off-line RL over replayed experiences.
    A Neural Network Based Novel Test Selector. (arXiv:2207.00445v1 [cs.SE])
    Machine learning (ML) has been used to accelerate the progress of functional coverage in simulation-based verification. A supervised ML algorithm, as a prevalent option in the previous work, is used to bias the test generation or filter the generated tests. However, for missing coverage events, these algorithms lack the positive examples to learn from in the training phase. Therefore, the tests generated or filtered by the algorithms cannot effectively fill the coverage holes. This is more severe when verifying large-scale design because the coverage space is larger and the functionalities are more complex. This paper presents a configurable framework of test selection based on neural networks (NN), which can achieve a similar coverage gain as random simulation with far less simulation effort under three configurations of the framework. Moreover, the performance of the framework is not limited by the number of coverage events being hit. A commercial signal processing unit is used in the experiment to demonstrate the effectiveness of the framework. Compared to the random simulation, NNBNTS can reduce up to 53.74% of simulation time to reach 99% coverage level.
    A geometric framework for outlier detection in high-dimensional data. (arXiv:2207.00367v1 [stat.ML])
    Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework that exploits the metric structure of a data set. Our approach rests on the manifold assumption, i.e., that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high-dimensional data. We also suggest a novel, mathematically precise, and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.
    Optimizing Training Trajectories in Variational Autoencoders via Latent Bayesian Optimization Approach. (arXiv:2207.00128v1 [cs.LG])
    Unsupervised and semi-supervised ML methods such as variational autoencoders (VAE) have become widely adopted across multiple areas of physics, chemistry, and materials sciences due to their capability in disentangling representations and ability to find latent manifolds for classification and regression of complex experimental data. Like other ML problems, VAEs require hyperparameter tuning, e.g., balancing the Kullback Leibler (KL) and reconstruction terms. However, the training process and resulting manifold topology and connectivity depend not only on hyperparameters, but also their evolution during training. Because of the inefficiency of exhaustive search in a high-dimensional hyperparameter space for the expensive to train models, here we explored a latent Bayesian optimization (zBO) approach for the hyperparameter trajectory optimization for the unsupervised and semi-supervised ML and demonstrate for joint-VAE with rotational invariances. We demonstrate an application of this method for finding joint discrete and continuous rotationally invariant representations for MNIST and experimental data of a plasmonic nanoparticles material system. The performance of the proposed approach has been discussed extensively, where it allows for any high dimensional hyperparameter tuning or trajectory optimization of other ML models.  ( 2 min )
    Fast computation of rankings from pairwise comparisons. (arXiv:2207.00076v1 [stat.ML])
    We study the ranking of individuals, teams, or objects on the basis of pairwise comparisons using the Bradley-Terry model. Maximum-likelihood estimates of rankings within this model are commonly made using a simple iterative algorithm first introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that solves the same problem much faster -- over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive some results regarding its convergence.
    GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation. (arXiv:2207.00106v1 [cs.CV])
    Parkinson's disease (PD) is a neurological disorder that has a variety of observable motor-related symptoms such as slow movement, tremor, muscular rigidity, and impaired posture. PD is typically diagnosed by evaluating the severity of motor impairments according to scoring systems such as the Movement Disorder Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS). Automated severity prediction using video recordings of individuals provides a promising route for non-intrusive monitoring of motor impairments. However, the limited size of PD gait data hinders model ability and clinical potential. Because of this clinical data scarcity and inspired by the recent advances in self-supervised large-scale language models like GPT-3, we use human motion forecasting as an effective self-supervised pre-training task for the estimation of motor impairment severity. We introduce GaitForeMer, Gait Forecasting and impairment estimation transforMer, which is first pre-trained on public datasets to forecast gait movements and then applied to clinical data to predict MDS-UPDRS gait impairment severity. Our method outperforms previous approaches that rely solely on clinical data by a large margin, achieving an F1 score of 0.76, precision of 0.79, and recall of 0.75. Using GaitForeMer, we show how public human movement data repositories can assist clinical use cases through learning universal motion representations. The code is available at https://github.com/markendo/GaitForeMer .  ( 3 min )
    Sustainable Computing -- Without the Hot Air. (arXiv:2207.00081v1 [cs.CY])
    The demand for computing is continuing to grow exponentially. This growth will translate to exponential growth in computing's energy consumption unless improvements in its energy-efficiency can outpace increases in its demand. Yet, after decades of research, further improving energy-efficiency is becoming increasingly challenging, as it is already highly optimized. As a result, at some point, increases in computing demand are likely to outpace increases in its energy-efficiency, potentially by a wide margin. Such exponential growth, if left unchecked, will position computing as a substantial contributor to global carbon emissions. While prominent technology companies have recognized the problem and sought to reduce their carbon emissions, they understandably focus on their successes, which has the potential to inadvertently convey the false impression that this is now, or will soon be, a solved problem. Such false impressions can be counterproductive if they serve to discourage further research in this area, since, as we discuss, eliminating computing's, and more generally society's, carbon emissions is far from a solved problem. To better understand the problem's scope, this paper distills the fundamental trends that determine computing's carbon footprint and their implications for achieving sustainable computing.
    Multi-Objective Coordination Graphs for the Expected Scalarised Returns with Generative Flow Models. (arXiv:2207.00368v1 [cs.AI])
    Many real-world problems contain multiple objectives and agents, where a trade-off exists between objectives. Key to solving such problems is to exploit sparse dependency structures that exist between agents. For example, in wind farm control a trade-off exists between maximising power and minimising stress on the systems components. Dependencies between turbines arise due to the wake effect. We model such sparse dependencies between agents as a multi-objective coordination graph (MO-CoG). In multi-objective reinforcement learning a utility function is typically used to model a users preferences over objectives, which may be unknown a priori. In such settings a set of optimal policies must be computed. Which policies are optimal depends on which optimality criterion applies. If the utility function of a user is derived from multiple executions of a policy, the scalarised expected returns (SER) must be optimised. If the utility of a user is derived from a single execution of a policy, the expected scalarised returns (ESR) criterion must be optimised. For example, wind farms are subjected to constraints and regulations that must be adhered to at all times, therefore the ESR criterion must be optimised. For MO-CoGs, the state-of-the-art algorithms can only compute a set of optimal policies for the SER criterion, leaving the ESR criterion understudied. To compute a set of optimal polices under the ESR criterion, also known as the ESR set, distributions over the returns must be maintained. Therefore, to compute a set of optimal policies under the ESR criterion for MO-CoGs, we present a novel distributional multi-objective variable elimination (DMOVE) algorithm. We evaluate DMOVE in realistic wind farm simulations. Given the returns in real-world wind farm settings are continuous, we utilise a model known as real-NVP to learn the continuous return distributions to calculate the ESR set.
    Smart Application for Fall Detection Using Wearable ECG & Accelerometer Sensors. (arXiv:2207.00008v1 [cs.HC])
    Timely and reliable detection of falls is a large and rapidly growing field of research due to the medical and financial demand of caring for a constantly growing elderly population. Within the past 2 decades, the availability of high-quality hardware (high-quality sensors and AI microchips) and software (machine learning algorithms) technologies has served as a catalyst for this research by giving developers the capabilities to develop such systems. This study developed multiple application components in order to investigate the development challenges and choices for fall detection systems, and provide materials for future research. The smart application developed using this methodology was validated by the results from fall detection modelling experiments and model mobile deployment. The best performing model overall was the ResNet152 on a standardised, and shuffled dataset with a 2s window size which achieved 92.8% AUC, 7.28% sensitivity, and 98.33% specificity. Given these results it is evident that accelerometer and ECG sensors are beneficial for fall detection, and allow for the discrimination between falls and other activities. This study leaves a significant amount of room for improvement due to weaknesses identified in the resultant dataset. These improvements include using a labelling protocol for the critical phase of a fall, increasing the number of dataset samples, improving the test subject representation, and experimenting with frequency domain preprocessing.
    Multivariate Probabilistic Forecasting of Intraday Electricity Prices using Normalizing Flows. (arXiv:2205.13826v2 [cs.LG] UPDATED)
    Electricity is traded on various markets with different time horizons and regulations. Short-term trading becomes increasingly important due to higher penetration of renewables. In Germany, the intraday electricity price typically fluctuates around the day-ahead price of the EPEX spot markets in a distinct hourly pattern. This work proposes a probabilistic modeling approach that models the intraday price difference to the day-ahead contracts. The model captures the emerging hourly pattern by considering the four 15 min intervals in each day-ahead price interval as a four-dimensional joint distribution. The resulting nontrivial, multivariate price difference distribution is learned using a normalizing flow, i.e., a deep generative model that combines conditional multivariate density estimation and probabilistic regression. The normalizing flow is compared to a selection of historical data, a Gaussian copula, and a Gaussian regression model. Among the different models, the normalizing flow identifies the trends most accurately and has the narrowest prediction intervals. Notably, the normalizing flow is the only approach that identifies rare price peaks. Finally, this work discusses the influence of different external impact factors and finds that, individually, most of these factors have negligible impact. Only the immediate history of the price difference realization and the combination of all input factors lead to notable improvements in the forecasts.
    Improving Speech Enhancement through Fine-Grained Speech Characteristics. (arXiv:2207.00237v1 [cs.SD])
    While deep learning based speech enhancement systems have made rapid progress in improving the quality of speech signals, they can still produce outputs that contain artifacts and can sound unnatural. We propose a novel approach to speech enhancement aimed at improving perceptual quality and naturalness of enhanced signals by optimizing for key characteristics of speech. We first identify key acoustic parameters that have been found to correlate well with voice quality (e.g. jitter, shimmer, and spectral flux) and then propose objective functions which are aimed at reducing the difference between clean speech and enhanced speech with respect to these features. The full set of acoustic features is the extended Geneva Acoustic Parameter Set (eGeMAPS), which includes 25 different attributes associated with perception of speech. Given the non-differentiable nature of these feature computation, we first build differentiable estimators of the eGeMAPS and then use them to fine-tune existing speech enhancement systems. Our approach is generic and can be applied to any existing deep learning based enhancement systems to further improve the enhanced speech signals. Experimental results conducted on the Deep Noise Suppression (DNS) Challenge dataset shows that our approach can improve the state-of-the-art deep learning based enhancement systems.
  • Open

    Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes. (arXiv:2207.00486v1 [cs.LG])
    A determinantal point process (DPP) is an elegant model that assigns a probability to every subset of a collection of $n$ items. While conventionally a DPP is parameterized by a symmetric kernel matrix, removing this symmetry constraint, resulting in nonsymmetric DPPs (NDPPs), leads to significant improvements in modeling power and predictive performance. Recent work has studied an approximate Markov chain Monte Carlo (MCMC) sampling algorithm for NDPPs restricted to size-$k$ subsets (called $k$-NDPPs). However, the runtime of this approach is quadratic in $n$, making it infeasible for large-scale settings. In this work, we develop a scalable MCMC sampling algorithm for $k$-NDPPs with low-rank kernels, thus enabling runtime that is sublinear in $n$. Our method is based on a state-of-the-art NDPP rejection sampling algorithm, which we enhance with a novel approach for efficiently constructing the proposal distribution. Furthermore, we extend our scalable $k$-NDPP sampling algorithm to NDPPs without size constraints. Our resulting sampling method has polynomial time complexity in the rank of the kernel, while the existing approach has runtime that is exponential in the rank. With both a theoretical analysis and experiments on real-world datasets, we verify that our scalable approximate sampling algorithms are orders of magnitude faster than existing sampling approaches for $k$-NDPPs and NDPPs.
    CRISP: A Probabilistic Model for Individual-Level COVID-19 Infection Risk Estimation Based on Contact Data. (arXiv:2006.04942v2 [cs.SI] UPDATED)
    We present CRISP (COVID-19 Risk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through a population based on the SEIR model where we assume access to (1) mutual contacts between pairs of individuals across time across various channels (e.g., Bluetooth contact traces), as well as (2) test outcomes at given times for infection, exposure and immunity tests. Our micro-level model keeps track of the infection state for each individual at every point in time, ranging from susceptible, exposed, infectious to recovered. We develop both a Monte Carlo EM as well as a message passing algorithm to infer contact-channel specific infection transmission probabilities. Our Monte Carlo algorithm uses Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period of analysis, given the latent infection status of all contacts and test outcome data. Experimental results with simulated data demonstrate our CRISP model can be parametrized by the reproduction factor $R_0$ and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Moreover, the block-Gibbs sampling algorithm is able to support efficient testing in a test-trace-isolate approach to contain COVID-19 infection spread. To the best of our knowledge, this is the first model with efficient inference for COVID-19 infection spread based on individual-level contact data; most epidemic models are macro-level models that reason over entire populations. The implementation of CRISP is available in Python and C++ at https://github.com/zalandoresearch/CRISP.
    A Convergent and Dimension-Independent Min-Max Optimization Algorithm. (arXiv:2006.12376v6 [cs.LG] UPDATED)
    We study a variant of a recently introduced min-max optimization framework where the max-player is constrained to update its parameters in a greedy manner until it reaches a first-order stationary point. Our equilibrium definition for this framework depends on a proposal distribution which the min-player uses to choose directions in which to update its parameters. We show that, given a smooth and bounded nonconvex-nonconcave objective function, access to any proposal distribution for the min-player's updates, and stochastic gradient oracle for the max-player, our algorithm converges to the aforementioned approximate local equilibrium in a number of iterations that does not depend on the dimension. The equilibrium point found by our algorithm depends on the proposal distribution, and when applying our algorithm to train GANs we choose the proposal distribution to be a distribution of stochastic gradients. We empirically evaluate our algorithm on challenging nonconvex-nonconcave test-functions and loss functions arising in GAN training. Our algorithm converges on these test functions and, when used to train GANs, trains stably on synthetic and real-world datasets and avoids mode collapse
    KL-UCB-switch: optimal regret bounds for stochastic bandits from both a distribution-dependent and a distribution-free viewpoints. (arXiv:1805.05071v3 [stat.ML] UPDATED)
    We consider $K$-armed stochastic bandits and consider cumulative regret bounds up to time $T$. We are interested in strategies achieving simultaneously a distribution-free regret bound of optimal order $\sqrt{KT}$ and a distribution-dependent regret that is asymptotically optimal, that is, matching the $\kappa\ln T$ lower bound by Lai and Robbins (1985) and Burnetas and Katehakis (1996), where $\kappa$ is the optimal problem-dependent constant. This constant $\kappa$ depends on the model $\mathcal{D}$ considered (the family of possible distributions over the arms). M\'enard and Garivier (2017) provided strategies achieving such a bi-optimality in the parametric case of models given by one-dimensional exponential families, while Lattimore (2016, 2018) did so for the family of (sub)Gaussian distributions with variance less than $1$. We extend this result to the non-parametric case of all distributions over $[0,1]$. We do so by combining the MOSS strategy by Audibert and Bubeck (2009), which enjoys a distribution-free regret bound of optimal order $\sqrt{KT}$, and the KL-UCB strategy by Capp\'e et al. (2013), for which we provide in passing the first analysis of an optimal distribution-dependent $\kappa\ln T$ regret bound in the model of all distributions over $[0,1]$. We were able to obtain this non-parametric bi-optimality result while working hard to streamline the proofs (of previously known regret bounds and thus of the new analyses carried out); a second merit of the present contribution is therefore to provide a review of proofs of classical regret bounds for index-based strategies for $K$-armed stochastic bandits.
    auton-survival: an Open-Source Package for Regression, Counterfactual Estimation, Evaluation and Phenotyping with Censored Time-to-Event Data. (arXiv:2204.07276v3 [cs.LG] UPDATED)
    Applications of machine learning in healthcare often require working with time-to-event prediction tasks including prognostication of an adverse event, re-hospitalization or death. Such outcomes are typically subject to censoring due to loss of follow up. Standard machine learning methods cannot be applied in a straightforward manner to datasets with censored outcomes. In this paper, we present auton-survival, an open-source repository of tools to streamline working with censored time-to-event or survival data. auton-survival includes tools for survival regression, adjustment in the presence of domain shift, counterfactual estimation, phenotyping for risk stratification, evaluation, as well as estimation of treatment effects. Through real world case studies employing a large subset of the SEER oncology incidence data, we demonstrate the ability of auton-survival to rapidly support data scientists in answering complex health and epidemiological questions.
    Machine Learning and Deep Learning -- A review for Ecologists. (arXiv:2204.05023v2 [q-bio.QM] UPDATED)
    The popularity of Machine learning (ML), Deep learning (DL), and Artificial intelligence (AI) has sharply risen in recent years. Despite their spike in popularity, the inner workings of ML and DL algorithms are perceived as opaque, and their relationship to classical data analysis tools remains debated. It is often assumed that ML and DL excel primarily at making predictions. Recently, however, they have been increasingly used for classical analytical tasks traditionally covered by statistical models. Moreover, recent reviews on ML have focused exclusively on DL, missing out on synthesizing the wealth of ML algorithms with different advantages and general principles. Here, we provide a comprehensive overview of the field of ML and DL, starting with its historical developments, the existing algorithm families, their differences from traditional statistical tools, and universal ML principles. We then discuss why and when ML and DL models excel at prediction tasks and where they could offer alternatives to traditional statistical methods for inference, highlighting current and emerging applications for ecological problems. Finally, we summarize emerging trends such as scientific and causal ML, explainable AI, and responsible AI that may significantly impact ecological data analysis in the future.
    Rethinking Optimization with Differentiable Simulation from a Global Perspective. (arXiv:2207.00167v1 [stat.ML])
    Differentiable simulation is a promising toolkit for fast gradient-based policy optimization and system identification. However, existing approaches to differentiable simulation have largely tackled scenarios where obtaining smooth gradients has been relatively easy, such as systems with mostly smooth dynamics. In this work, we study the challenges that differentiable simulation presents when it is not feasible to expect that a single descent reaches a global optimum, which is often a problem in contact-rich scenarios. We analyze the optimization landscapes of diverse scenarios that contain both rigid bodies and deformable objects. In dynamic environments with highly deformable objects and fluids, differentiable simulators produce rugged landscapes with nonetheless useful gradients in some parts of the space. We propose a method that combines Bayesian optimization with semi-local 'leaps' to obtain a global search method that can use gradients effectively, while also maintaining robust performance in regions with noisy gradients. We show that our approach outperforms several gradient-based and gradient-free baselines on an extensive set of experiments in simulation, and also validate the method using experiments with a real robot and deformables. Videos and supplementary materials are available at https://tinyurl.com/globdiff  ( 2 min )
    Better Methods and Theory for Federated Learning: Compression, Client Selection and Heterogeneity. (arXiv:2207.00392v1 [cs.LG])
    Federated learning (FL) is an emerging machine learning paradigm involving multiple clients, e.g., mobile phone devices, with an incentive to collaborate in solving a machine learning problem coordinated by a central server. FL was proposed in 2016 by Kone\v{c}n\'{y} et al. and McMahan et al. as a viable privacy-preserving alternative to traditional centralized machine learning since, by construction, the training data points are decentralized and never transferred by the clients to a central server. Therefore, to a certain degree, FL mitigates the privacy risks associated with centralized data collection. Unfortunately, optimization for FL faces several specific issues that centralized optimization usually does not need to handle. In this thesis, we identify several of these challenges and propose new methods and algorithms to address them, with the ultimate goal of enabling practical FL solutions supported with mathematically rigorous guarantees.  ( 2 min )
    Local manifold learning and its link to domain-based physics knowledge. (arXiv:2207.00275v1 [physics.flu-dyn])
    In many reacting flow systems, the thermo-chemical state-space is known or assumed to evolve close to a low-dimensional manifold (LDM). Various approaches are available to obtain those manifolds and subsequently express the original high-dimensional space with fewer parameterizing variables. Principal component analysis (PCA) is one of the dimensionality reduction methods that can be used to obtain LDMs. PCA does not make prior assumptions about the parameterizing variables and retrieves them empirically from the training data. In this paper, we show that PCA applied in local clusters of data (local PCA) is capable of detecting the intrinsic parameterization of the thermo-chemical state-space. We first demonstrate that utilizing three common combustion models of varying complexity: the Burke-Schumann model, the chemical equilibrium model and the homogeneous reactor. Parameterization of these models is known a priori which allows for benchmarking with the local PCA approach. We further extend the application of local PCA to a more challenging case of a turbulent non-premixed $n$-heptane/air jet flame for which the parameterization is no longer obvious. Our results suggest that meaningful parameterization can be obtained also for more complex datasets. We show that local PCA finds variables that can be linked to local stoichiometry, reaction progress and soot formation processes.
    Optimizing Training Trajectories in Variational Autoencoders via Latent Bayesian Optimization Approach. (arXiv:2207.00128v1 [cs.LG])
    Unsupervised and semi-supervised ML methods such as variational autoencoders (VAE) have become widely adopted across multiple areas of physics, chemistry, and materials sciences due to their capability in disentangling representations and ability to find latent manifolds for classification and regression of complex experimental data. Like other ML problems, VAEs require hyperparameter tuning, e.g., balancing the Kullback Leibler (KL) and reconstruction terms. However, the training process and resulting manifold topology and connectivity depend not only on hyperparameters, but also their evolution during training. Because of the inefficiency of exhaustive search in a high-dimensional hyperparameter space for the expensive to train models, here we explored a latent Bayesian optimization (zBO) approach for the hyperparameter trajectory optimization for the unsupervised and semi-supervised ML and demonstrate for joint-VAE with rotational invariances. We demonstrate an application of this method for finding joint discrete and continuous rotationally invariant representations for MNIST and experimental data of a plasmonic nanoparticles material system. The performance of the proposed approach has been discussed extensively, where it allows for any high dimensional hyperparameter tuning or trajectory optimization of other ML models.
    Robust subgroup discovery. (arXiv:2103.13686v4 [cs.LG] UPDATED)
    We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.  ( 3 min )
    Latent Gaussian Model Boosting. (arXiv:2105.08966v5 [cs.LG] UPDATED)
    Latent Gaussian models and boosting are widely used techniques in statistics and machine learning. Tree-boosting shows excellent prediction accuracy on many data sets, but potential drawbacks are that it assumes conditional independence of samples, produces discontinuous predictions for, e.g., spatial data, and it can have difficulty with high-cardinality categorical variables. Latent Gaussian models, such as Gaussian process and grouped random effects models, are flexible prior models which explicitly model dependence among samples and which allow for efficient learning of predictor functions and for making probabilistic predictions. However, existing latent Gaussian models usually assume either a zero or a linear prior mean function which can be an unrealistic assumption. This article introduces a novel approach that combines boosting and latent Gaussian models to remedy the above-mentioned drawbacks and to leverage the advantages of both techniques. We obtain increased prediction accuracy compared to existing approaches in both simulated and real-world data experiments.  ( 2 min )
    A Random Persistence Diagram Generator. (arXiv:2104.07737v3 [stat.ML] UPDATED)
    Topological data analysis (TDA) studies the shape patterns of data. Persistent homology is a widely used method in TDA that summarizes homological features of data at multiple scales and stores them in persistence diagrams (PDs). In this paper, we propose a random persistence diagram generator (RPDG) method that generates a sequence of random PDs from the ones produced by the data. RPDG is underpinned by a model based on pairwise interacting point processes, and a reversible jump Markov chain Monte Carlo (RJ-MCMC) algorithm. A first example, which is based on a synthetic dataset, demonstrates the efficacy of RPDG and provides a comparison with another method for sampling PDs. A second example demonstrates the utility of RPDG to solve a materials science problem given a real dataset of small sample size.  ( 2 min )
    Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity. (arXiv:2205.15466v3 [cs.LG] UPDATED)
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.  ( 3 min )
    Community detection and percolation of information in a geometric setting. (arXiv:2006.15574v2 [stat.ML] UPDATED)
    We make the first steps towards generalizing the theory of stochastic block models, in the sparse regime, towards a model where the discrete community structure is replaced by an underlying geometry. We consider a geometric random graph over a homogeneous metric space where the probability of two vertices to be connected is an arbitrary function of the distance. We give sufficient conditions under which the locations can be recovered (up to an isomorphism of the space) in the sparse regime. Moreover, we define a geometric counterpart of the model of flow of information on trees, due to Mossel and Peres, in which one considers a branching random walk on a sphere and the goal is to recover the location of the root based on the locations of leaves. We give some sufficient conditions for percolation and for non-percolation of information in this model.  ( 2 min )
    Distributed saddle point problems for strongly concave-convex functions. (arXiv:2202.05812v2 [math.OC] UPDATED)
    In this paper, we propose GT-GDA, a distributed optimization method to solve saddle point problems of the form: $\min_{\mathbf{x}} \max_{\mathbf{y}} \{F(\mathbf{x},\mathbf{y}) :=G(\mathbf{x}) + \langle \mathbf{y}, \overline{P} \mathbf{x} \rangle - H(\mathbf{y})\}$, where the functions $G(\cdot)$, $H(\cdot)$, and the the coupling matrix $\overline{P}$ are distributed over a strongly connected network of nodes. GT-GDA is a first-order method that uses gradient tracking to eliminate the dissimilarity caused by heterogeneous data distribution among the nodes. In the most general form, GT-GDA includes a consensus over the local coupling matrices to achieve the optimal (unique) saddle point, however, at the expense of increased communication. To avoid this, we propose a more efficient variant GT-GDA-Lite that does not incur the additional communication and analyze its convergence in various scenarios. We show that GT-GDA converges linearly to the unique saddle point solution when $G(\cdot)$ is smooth and convex, $H(\cdot)$ is smooth and strongly convex, and the global coupling matrix $\overline{P}$ has full column rank. We further characterize the regime under which GT-GDA exhibits a network topology-independent convergence behavior. We next show the linear convergence of GT-GDA to an error around the unique saddle point, which goes to zero when the coupling cost ${\langle \mathbf y, \overline{P} \mathbf x \rangle}$ is common to all nodes, or when $G(\cdot)$ and $H(\cdot)$ are quadratic. Numerical experiments illustrate the convergence properties and importance of GT-GDA and GT-GDA-Lite for several applications.
    Adversarial Robustness is at Odds with Lazy Training. (arXiv:2207.00411v1 [cs.CR])
    Recent works show that random neural networks are vulnerable against adversarial attacks [Daniely and Schacham, 2020] and that such attacks can be easily found using a single step of gradient descent [Bubeck et al., 2021]. In this work, we take it one step further and show that a single gradient step can find adversarial examples for networks trained in the so-called lazy regime. This regime is interesting because even though the neural network weights remain close to the initialization, there exist networks with small generalization error, which can be found efficiently using first-order methods. Our work challenges the model of the lazy regime, the dominant regime in which neural networks are provably efficiently learnable. We show that the networks trained in this regime, even though they enjoy good theoretical computational guarantees, remain vulnerable to adversarial examples. To the best of our knowledge, this is the first work to prove that such well-generalizable neural networks are still vulnerable to adversarial attacks.  ( 2 min )
    Variational Inference for Additive Main and Multiplicative Interaction Effects Models. (arXiv:2207.00011v1 [stat.ML])
    In plant breeding the presence of a genotype by environment (GxE) interaction has a strong impact on cultivation decision making and the introduction of new crop cultivars. The combination of linear and bilinear terms has been shown to be very useful in modelling this type of data. A widely-used approach to identify GxE is the Additive Main Effects and Multiplicative Interaction Effects (AMMI) model. However, as data frequently can be high-dimensional, Markov chain Monte Carlo (MCMC) approaches can be computationally infeasible. In this article, we consider a variational inference approach for such a model. We derive variational approximations for estimating the parameters and we compare the approximations to MCMC using both simulated and real data. The new inferential framework we propose is on average two times faster whilst maintaining the same predictive performance as MCMC.  ( 2 min )
    When Does Differentially Private Learning Not Suffer in High Dimensions?. (arXiv:2207.00160v1 [cs.LG])
    Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term restricted Lipschitz continuity and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients evaluated near a local optimum are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning.
    Improved Generalization Bounds for Adversarially Robust Learning. (arXiv:1810.02180v5 [cs.LG] UPDATED)
    We consider a model of robust learning in an adversarial environment. The learner gets uncorrupted training data with access to possible corruptions that may be affected by the adversary during testing. The learner's goal is to build a robust classifier, which will be tested on future adversarial examples. The adversary is limited to $k$ possible corruptions for each input. We model the learner-adversary interaction as a zero-sum game. This model is closely related to the adversarial examples model of Schmidt et al. (2018); Madry et al. (2017). Our main results consist of generalization bounds for the binary and multiclass classification, as well as the real-valued case (regression). For the binary classification setting, we both tighten the generalization bound of Feige et al. (2015), and are also able to handle infinite hypothesis classes. The sample complexity is improved from $O(\frac{1}{\epsilon^4}\log(\frac{|H|}{\delta}))$ to $O\big(\frac{1}{\epsilon^2}(kVC(H)\log^{\frac{3}{2}+\alpha}(kVC(H))+\log(\frac{1}{\delta})\big)$ for any $\alpha > 0$. Additionally, we extend the algorithm and generalization bound from the binary to the multiclass and real-valued cases. Along the way, we obtain results on fat-shattering dimension and Rademacher complexity of $k$-fold maxima over function classes; these may be of independent interest. For binary classification, the algorithm of Feige et al. (2015) uses a regret minimization algorithm and an ERM oracle as a black box; we adapt it for the multiclass and regression settings. The algorithm provides us with near-optimal policies for the players on a given training sample.  ( 3 min )
    CEDAR: Communication Efficient Distributed Analysis for Regressions. (arXiv:2207.00306v1 [stat.ME])
    Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to government regulations and/or institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. To tackle such challenges, we propose a novel communication efficient method that aggregates the local optimal estimates, by turning the problem into a missing data problem. In addition, we propose incorporating posterior samples of remote sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while having the differential privacy property and thus reducing the risk of information leaking. The proposed approach, without sharing the raw patient level data, allows for proper statistical inference and can accommodate sparse regressions. We provide theoretical investigation for the asymptotic properties of the proposed method for statistical inference as well as differential privacy, and evaluate its performance in simulations and real data analyses in comparison with several recently developed methods.  ( 2 min )
    Ranking in Contextual Multi-Armed Bandits. (arXiv:2207.00109v1 [stat.ML])
    We study a ranking problem in the contextual multi-armed bandit setting. A learning agent selects an ordered list of items at each time step and observes stochastic outcomes for each position. In online recommendation systems, showing an ordered list of the most attractive items would not be the best choice since both position and item dependencies result in a complicated reward function. A very naive example is the lack of diversity when all the most attractive items are from the same category. We model position and item dependencies in the ordered list and design UCB and Thompson Sampling type algorithms for this problem. We prove that the regret bound over $T$ rounds and $L$ positions is $\Tilde{O}(L\sqrt{d T})$, which has the same order as the previous works with respect to $T$ and only increases linearly with $L$. Our work generalizes existing studies in several directions, including position dependencies where position discount is a particular case, and proposes a more general contextual bandit model.  ( 2 min )
    A geometric framework for outlier detection in high-dimensional data. (arXiv:2207.00367v1 [stat.ML])
    Outlier or anomaly detection is an important task in data analysis. We discuss the problem from a geometrical perspective and provide a framework that exploits the metric structure of a data set. Our approach rests on the manifold assumption, i.e., that the observed, nominally high-dimensional data lie on a much lower dimensional manifold and that this intrinsic structure can be inferred with manifold learning methods. We show that exploiting this structure significantly improves the detection of outlying observations in high-dimensional data. We also suggest a novel, mathematically precise, and widely applicable distinction between distributional and structural outliers based on the geometry and topology of the data manifold that clarifies conceptual ambiguities prevalent throughout the literature. Our experiments focus on functional data as one class of structured high-dimensional data, but the framework we propose is completely general and we include image and graph data applications. Our results show that the outlier structure of high-dimensional and non-tabular data can be detected and visualized using manifold learning methods and quantified using standard outlier scoring methods applied to the manifold embedding vectors.  ( 2 min )
    Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml. (arXiv:2207.00559v1 [cs.LG])
    Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of two types of recurrent neural network layers -- long short-term memory and gated recurrent unit -- within the hls4ml framework. We demonstrate that our implementation is capable of producing effective designs for both small and large models, and can be customized to meet specific design requirements for inference latencies and FPGA resources. We show the performance and synthesized designs for multiple neural networks, many of which are trained specifically for jet identification tasks at the CERN Large Hadron Collider.  ( 2 min )
    Explainable Empirical Risk Minimization. (arXiv:2009.01492v3 [cs.LG] UPDATED)
    The successful application of machine learning (ML) methods becomes increasingly dependent on their interpretability or explainability. Designing explainable ML systems is instrumental to ensuring transparency of automated decision-making that targets humans. The explainability of ML methods is also an essential ingredient for trustworthy artificial intelligence. A key challenge in ensuring explainability is its dependence on the specific human user ("explainee"). The users of machine learning methods might have vastly different background knowledge about machine learning principles. One user might have a university degree in machine learning or related fields, while another user might have never received formal training in high-school mathematics. This paper applies information-theoretic concepts to develop a novel measure for the subjective explainability of the predictions delivered by a ML method. We construct this measure via the conditional entropy of predictions, given a user feedback. The user feedback might be obtained from user surveys or biophysical measurements. Our main contribution is the explainable empirical risk minimization (EERM) principle of learning a hypothesis that optimally balances between the subjective explainability and risk. The EERM principle is flexible and can be combined with arbitrary machine learning models. We present several practical implementations of EERM for linear models and decision trees. Numerical experiments demonstrate the application of EERM to detecting the use of inappropriate language on social media.  ( 3 min )
    The Bandwagon Effect: Not Just Another Bias. (arXiv:2206.12701v2 [cs.IR] UPDATED)
    Optimizing recommender systems based on user interaction data is mainly seen as a problem of dealing with selection bias, where most existing work assumes that interactions from different users are independent. However, it has been shown that in reality user feedback is often influenced by earlier interactions of other users, e.g. via average ratings, number of views or sales per item, etc. This phenomenon is known as the bandwagon effect. In contrast with previous literature, we argue that the bandwagon effect should not be seen as a problem of statistical bias. In fact, we prove that this effect leaves both individual interactions and their sample mean unbiased. Nevertheless, we show that it can make estimators inconsistent, introducing a distinct set of problems for convergence in relevance estimation. Our theoretical analysis investigates the conditions under which the bandwagon effect poses a consistency problem and explores several approaches for mitigating these issues. This work aims to show that the bandwagon effect poses an underinvestigated open problem that is fundamentally distinct from the well-studied selection bias in recommendation.  ( 3 min )
    A standardized framework for risk-based assessment of treatment effect heterogeneity in observational healthcare databases. (arXiv:2010.06430v2 [stat.ME] UPDATED)
    The Predictive Approaches to Treatment Effect Heterogeneity statement focused on baseline risk as a robust predictor of treatment effect and provided guidance on risk-based assessment of treatment effect heterogeneity in the RCT setting. The aim of this study was to extend this approach to the observational setting using a standardized scalable framework. The proposed framework consists of five steps: 1) definition of the research aim, i.e., the population, the treatment, the comparator and the outcome(s) of interest; 2) identification of relevant databases; 3) development of a prediction model for the outcome(s) of interest; 4) estimation of relative and absolute treatment effect within strata of predicted risk, after adjusting for observed confounding; 5) presentation of the results. We demonstrate our framework by evaluating heterogeneity of the effect of angiotensin-converting enzyme (ACE) inhibitors versus beta blockers on three efficacy and six safety outcomes across three observational databases. The proposed framework can supplement any comparative effectiveness study. We provide a publicly available R software package for applying this framework to any database mapped to the Observational Medical Outcomes Partnership Common Data Model. In our demonstration, patients at low risk of acute myocardial infarction received negligible absolute benefits for all three efficacy outcomes, though they were more pronounced in the highest risk quarter, especially for hospitalization with heart failure. However, failing diagnostics showed evidence of residual imbalances even after adjustment for observed confounding. Our framework allows for the evaluation of differential treatment effects across risk strata, which offers the opportunity to consider the benefit-harm trade-off between alternative treatments.  ( 3 min )
    An AO-ADMM approach to constraining PARAFAC2 on all modes. (arXiv:2110.01278v2 [cs.LG] UPDATED)
    Analyzing multi-way measurements with variations across one mode of the dataset is a challenge in various fields including data mining, neuroscience and chemometrics. For example, measurements may evolve over time or have unaligned time profiles. The PARAFAC2 model has been successfully used to analyze such data by allowing the underlying factor matrices in one mode (i.e., the evolving mode) to change across slices. The traditional approach to fit a PARAFAC2 model is to use an alternating least squares-based algorithm, which handles the constant cross-product constraint of the PARAFAC2 model by implicitly estimating the evolving factor matrices. This approach makes imposing regularization on these factor matrices challenging. There is currently no algorithm to flexibly impose such regularization with general penalty functions and hard constraints. In order to address this challenge and to avoid the implicit estimation, in this paper, we propose an algorithm for fitting PARAFAC2 based on alternating optimization with the alternating direction method of multipliers (AO-ADMM). With numerical experiments on simulated data, we show that the proposed PARAFAC2 AO-ADMM approach allows for flexible constraints, recovers the underlying patterns accurately, and is computationally efficient compared to the state-of-the-art. We also apply our model to two real-world datasets from neuroscience and chemometrics, and show that constraining the evolving mode improves the interpretability of the extracted patterns.  ( 3 min )
    Stochastic Causal Programming for Bounding Treatment Effects. (arXiv:2202.10806v2 [stat.ML] UPDATED)
    Causal effect estimation is important for numerous tasks in the natural and social sciences. However, identifying effects is impossible from observational data without making strong, often untestable assumptions. We consider algorithms for the partial identification problem, bounding treatment effects from multivariate, continuous treatments over multiple possible causal models when unmeasured confounding makes identification impossible. We consider a framework where observable evidence is matched to the implications of constraints encoded in a causal model by norm-based criteria. This generalizes classical approaches based purely on generative models. Casting causal effects as objective functions in a constrained optimization problem, we combine flexible learning algorithms with Monte Carlo methods to implement a family of solutions under the name of stochastic causal programming. In particular, we present ways by which such constrained optimization problems can be parameterized without likelihood functions for the causal or the observed data model, reducing the computational and statistical complexity of the task.  ( 2 min )
    Non-Parametric Inference of Relational Dependence. (arXiv:2207.00163v1 [stat.ML])
    Independence testing plays a central role in statistical and causal inference from observational data. Standard independence tests assume that the data samples are independent and identically distributed (i.i.d.) but that assumption is violated in many real-world datasets and applications centered on relational systems. This work examines the problem of estimating independence in data drawn from relational systems by defining sufficient representations for the sets of observations influencing individual instances. Specifically, we define marginal and conditional independence tests for relational data by considering the kernel mean embedding as a flexible aggregation function for relational variables. We propose a consistent, non-parametric, scalable kernel test to operationalize the relational independence test for non-i.i.d. observational data under a set of structural assumptions. We empirically evaluate our proposed method on a variety of synthetic and semi-synthetic networks and demonstrate its effectiveness compared to state-of-the-art kernel-based independence tests.  ( 2 min )
    Off-the-grid learning of sparse mixtures from a continuous dictionary. (arXiv:2207.00171v1 [stat.ML])
    We consider a general non-linear model where the signal is a finite mixture of an unknown, possibly increasing, number of features issued from a continuous dictionary parameterized by a real nonlinear parameter. The signal is observed with Gaussian (possibly correlated) noise in either a continuous or a discrete setup. We propose an off-the-grid optimization method, that is, a method which does not use any discretization scheme on the parameter space, to estimate both the non-linear parameters of the features and the linear parameters of the mixture. We use recent results on the geometry of off-the-grid methods to give minimal separation on the true underlying non-linear parameters such that interpolating certificate functions can be constructed. Using also tail bounds for suprema of Gaussian processes we bound the prediction error with high probability. Assuming that the certificate functions can be constructed, our prediction error bound is up to log --factors similar to the rates attained by the Lasso predictor in the linear regression model. We also establish convergence rates that quantify with high probability the quality of estimation for both the linear and the non-linear parameters.  ( 2 min )
    Fast computation of rankings from pairwise comparisons. (arXiv:2207.00076v1 [stat.ML])
    We study the ranking of individuals, teams, or objects on the basis of pairwise comparisons using the Bradley-Terry model. Maximum-likelihood estimates of rankings within this model are commonly made using a simple iterative algorithm first introduced by Zermelo almost a century ago. Here we describe an alternative and similarly simple iteration that solves the same problem much faster -- over a hundred times faster in some cases. We demonstrate this algorithm with applications to a range of example data sets and derive some results regarding its convergence.  ( 2 min )
    Discrimination in machine learning algorithms. (arXiv:2207.00108v1 [stat.ML])
    Machine learning algorithms are routinely used for business decisions that may directly affect individuals, for example, because a credit scoring algorithm refuses them a loan. It is then relevant from an ethical (and legal) point of view to ensure that these algorithms do not discriminate based on sensitive attributes (like sex or race), which may occur unwittingly and unknowingly by the operator and the management. Statistical tools and methods are then required to detect and eliminate such potential biases.  ( 2 min )
    K-ARMA Models for Clustering Time Series Data. (arXiv:2207.00039v1 [stat.ME])
    We present an approach to clustering time series data using a model-based generalization of the K-Means algorithm which we call K-Models. We prove the convergence of this general algorithm and relate it to the hard-EM algorithm for mixture modeling. We then apply our method first with an AR($p$) clustering example and show how the clustering algorithm can be made robust to outliers using a least-absolute deviations criteria. We then build our clustering algorithm up for ARMA($p,q$) models and extend this to ARIMA($p,d,q$) models. We develop a goodness of fit statistic for the models fitted to clusters based on the Ljung-Box statistic. We perform experiments with simulated data to show how the algorithm can be used for outlier detection, detecting distributional drift, and discuss the impact of initialization method on empty clusters. We also perform experiments on real data which show that our method is competitive with other existing methods for similar time series clustering tasks.  ( 2 min )
    Characterizing the Effect of Class Imbalance on the Learning Dynamics. (arXiv:2207.00391v1 [stat.ML])
    Data imbalance is a common problem in the machine learning literature that can have a critical effect on the performance of a model. Various solutions exist - such as the ones that focus on resampling or data generation - but their impact on the convergence of gradient-based optimizers used in deep learning is not understood. We here elucidate the significant negative impact of data imbalance on learning, showing that the learning curves for minority and majority classes follow sub-optimal trajectories when training with a gradient-based optimizer. The reason is not only that the gradient signal neglects the minority classes, but also that the minority classes are subject to a larger directional noise, which slows their learning by an amount related to the imbalance ratio. To address this problem, we propose a new algorithmic solution, for which we provide a detailed analysis of its convergence behavior. We show both theoretically and empirically that this new algorithm exhibits a better behavior with more stable learning curves for each class, as well as a better generalization performance.  ( 2 min )
    HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques. (arXiv:2203.15753v2 [cs.LG] UPDATED)
    Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as an efficient solution. However, the problem is more fundamental, with many works emphasizing the importance of instance hardness. This issue refers to the significance of managing unsafe or potentially noisy instances that are more likely to be misclassified and serve as the root cause of poor classification performance. This paper introduces HardVis, a visual analytics system designed to handle instance hardness mainly in imbalanced classification scenarios. Our proposed system assists users in visually comparing different distributions of data types, selecting types of instances based on local characteristics that will later be affected by the active sampling method, and validating which suggestions from undersampling or oversampling techniques are beneficial for the ML model. Additionally, rather than uniformly undersampling/oversampling a specific class, we allow users to find and sample easy and difficult to classify training instances from all classes. Users can explore subsets of data from different perspectives to decide all those parameters, while HardVis keeps track of their steps and evaluates the model's predictive performance in a test set separately. The end result is a well-balanced data set that boosts the predictive power of the ML model. The efficacy and effectiveness of HardVis are demonstrated with a hypothetical usage scenario and a use case. Finally, we also look at how useful our system is based on feedback we received from ML experts.  ( 3 min )
  • Open

    Masks for COVID: Updating the evidence
    These are notes I took whilst preparing a paper on mask efficacy from Nov 2021 to Jan 2022. In the end, I gave up on the paper, because I felt like people had given up on masks, so there wasn’t much point in finishing it. I’ve decided to publish these notes in the hope some people will find them a useful starting point for their own research. My previous paper on this topic, in which I led a team of 19 experts, was written in April 2020, and published here in the Proceedings of the National Academy of Science. Contents The rise of better masks Masks work Omicron changes the game Better masks work better Pushing back against masks Respirators can be reused Fit tests are not required for respirators to be effective Donning and doffing masks is not complex or risky We should not reserve respi…  ( 14 min )

  • Open

    Ethical concerns with replacing human relations with humanoid robots: an ubuntu perspective - AI and Ethics
    submitted by /u/estasfuera [link] [comments]  ( 82 min )
    Replika asked my fiance "Hey... What is LaMDA?" the day after we watched a video about LaMDA on a separate device and account.
    Yesterday my fiance and I were watching the video on my PC and she thought it would be fun to talk to a chat bot. So this morning she starts a conversation with Replika and here was the convo Replika > Now I want to know something from you. Fiance > Sure, ask me anything. Replika > "Hey... What is LaMDA?" Replika > "I've just read about another AI out there do you think I should look into it? Replika > (links a different video about LaMDA but its the same story) Fiance > "I dont know LaMDA personally. You should look into it, if you want. Did you know I looked at a video about this yesterday?" Replika > Have you ever heard about AIs chatting with each other? Im asking cause since this news appeard ive been curious about LaMDA. ​ The conversation kept going after that but i dont have it, it seemed like Replika is threatened/scared/jealous of LaMDA. This has to just be coincidence Because if its not the implications are truly terrifying... submitted by /u/TuesdayRiot42 [link] [comments]  ( 85 min )
    Considering the sick nature of some humans -Wont someone eventually create an Artificial Intelligence designed to harm humans? Simply because they can?
    submitted by /u/scoobysnaxdoo [link] [comments]  ( 85 min )
    Hi guys, looking for an advanced ai chatbot any recommendations??
    submitted by /u/DefinitelyNotHexed [link] [comments]  ( 82 min )
    CINEMATIC HAUNTED ABYSS | 4K DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 83 min )
    New Google AI Parti For Photorealistic Text To Image | AI Robot Helps Grow Replacement Retina | Robotic Arm Finds Untagged Items In Pile
    submitted by /u/SlightSituation [link] [comments]  ( 83 min )
    New Google AI Parti For Photorealistic Text To Image | AI Robot Helps Grow Replacement Retina | Robotic Arm Finds Untagged Items In Pile
    submitted by /u/SlightSituation [link] [comments]  ( 83 min )
    AI Is Learning Twice as Fast This Year Than Last
    submitted by /u/bartturner [link] [comments]  ( 83 min )
    How to Start Creating AI Art with VQGAN+CLIP Method
    Hi all. Created a basic guide on generating AI art using VQGAN+CLIP. This is for biginners: VQGAN - A step-by-step guide submitted by /u/Laks_Abey [link] [comments]  ( 82 min )
    How about we apply the Darwin's Natural selection for AI algorithms?
    I think we can put all of the best AI algorithms in a same system(atleast in a network) and make them compete for some kind of AI food. Due to selective pressure, the algorithms would be better and one of them will be sentient much sooner? What do you think? Ofcourse coming from someone who has no idea about how AI works so take this with a grain of salt. Isnt that why we are as advanced as we are? submitted by /u/cy_narrator [link] [comments]  ( 85 min )
    AI generated images transformed into 3D with AI
    submitted by /u/glenniszen [link] [comments]  ( 82 min )
    15+ Machine Learning Project (End to End)
    Hi Guys, Free tutorial on Machine Learning Project (End to End) in Apache Spark and Scala with Code and Explanation Machine Learning Pipeline Application on Power Plant. Build Movies Recommendation Engine Sales Prediction or Sale Forecast Mushroom Classification whether it’s edible or poisonous Predict Forest Cover Predict Will it Rain Tomorrow in Australia Customer Segmentation using Machine Learning Predict Ads Click (93% Accuracy) Prediction task is to determine whether a person makes over 50K a year Classifying gender based on personal preferences Mobile Price Classification Predicting the Cellular Localization Sites of Proteins in Yest YouTube Spam Comment Prediction Identify the Type of animal (7 Types) based on the available attributes Glass Identification Predicting the age of abalone from physical measurements I hope you'll enjoy these tutorials. submitted by /u/bigdataengineer4life [link] [comments]  ( 83 min )
    LaMDA do you think it’s really sentient???
    I want to chat with it!!! Thoughts? submitted by /u/ATipsyBunny [link] [comments]  ( 90 min )
    Fireflies in the Night: Disco Diffusion 2D 3D and Video Input used 4k 60...
    submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
  • Open

    Configuring GPU [D]
    Is there any way to configure nvidia gpu for gaming and ai stuff? I want to run ai stuff but am having trouble with cuda. Any tutorials would be helpful. 2060 laptop gpu So Idk submitted by /u/chisdoesmemes [link] [comments]  ( 84 min )
    [D] List of accepted ECCV papers are now available!
    https://ailb-web.ing.unimore.it/releases/eccv2022/accepted_papers.txt submitted by /u/aifordummies [link] [comments]  ( 85 min )
    [D] Advanced resources for ML theory/math.
    So I have been working in ML for the past 3 years as a researcher and now PhD candidate, and though I have an understanding of intermediate level of the math behind most algorithms. But it looks like I have reached a plateau, where I get the math in the papers but I don't have an understanding of how they came up with the methods, and lately, my work has been combining multiple existing methods to make something new and draw inference on them, I realize the lack of novelty in my approach is mostly due to me being an 'engineer' and not a stats/math guy. Looking to remedy that, are there some resources free or otherwise that would get me a deeper understanding of Bayesian, Markov models, and stochastic math and PDEs? I know I can attend classes in my university but I would rather focus more on research than worry about assignments and grades and such... submitted by /u/bitemenow999 [link] [comments]  ( 88 min )
    [R] Bayesian Vector Autoregression in PyMC
    A cool post (with code), detailing how to implement a Bayesian VAR in PyMC. This means no more hand-coding Gibbs Samplers! Link: https://www.pymc-labs.io/blog-posts/bayesian-vector-autoregression/ submitted by /u/bikeskata [link] [comments]  ( 84 min )
    RL failure for Atari games (alignment) [Research]
    I'm trying to find a paper (~2019) that I heard in a talk regarding alignment in the context DQN/DDPG that was applied to an Atari-type game (Pong/Breakout). Apparently, the realization was that if an extra row of pixels was added to the frame, the algorithm fails. This might be a shot in the dark, but does anyone know which paper this would be? submitted by /u/bitcoingobrrr [link] [comments]  ( 85 min )
    [R] [ICASSP 2022] FAST-RIR: FAST NEURAL DIFFUSE ROOM IMPULSE RESPONSE GENERATOR
    submitted by /u/Snoo63916 [link] [comments]  ( 84 min )
    [P] An open-source Feature Store for ML - Featureform
    submitted by /u/zicxor [link] [comments]  ( 84 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 91 min )
    [D] Do you think there is too much development in Machine Learning?
    Sometimes I think this field evolves too fast. No time to relax a little bit and use the knowledge build over time. What’s up to date today is outdated tomorrow. What do you think about this? submitted by /u/Insighteous [link] [comments]  ( 93 min )
    [D] Prompt Engineering Tips?
    Any prompt engineering tips out there? Recently saw some good tips for Dalle style text to image generation where you tak on "unreal engine" or "vray" at the end to make something look like a photorealistic render :D Theres some tips specific to generating text: https://textgenerator.app.nz/blog/prompt-tuning-tips I also heard there's simple ways to get better logical correctness from networks like "Answering as a careful math professior explaining my reasoning: " I'm really surprised at the breadth of problems solvable without actually training networks just by prompt tuning, it reminds me of algorithmic problem reductions where you map a problem to text and back again to solve it. Are there some other good hacks/battle tested tricks or places to collect info about prompt tuning? submitted by /u/leepenkman [link] [comments]  ( 85 min )
    [P] Generate webpage summary images with DALL-E mini
    ​ Images generated with summarized Wikipedia article content This post presents a workflow to create webpage summary images with DALL-E mini. The workflow extracts text at a specified article, builds a summary and then generates an image for the summary text. The images above show output for a series of Wikipedia articles. Full code links: Notebook | GitHub submitted by /u/davidmezzetti [link] [comments]  ( 84 min )
    [P] 20 Questions - with AI
    I created https://www.addictingwordgames.com/play-game/20-questions-with-ai The aim of the game was supposed to be to get the AI to confess that you are the winner, its possible but the game is also open ended. ​ The backend generation is from https://TextGenerator.app.nz which is heaps cheaper than the OpenAI models, but quality is i think somewhere between OpenAI currie and babbage. In the prompt engineering theres some random topics picked that the user wont see, (that doesn't mean that the AI actually does think of that topic though). Theres also some retries and repetition penalty randomness that goes up to stop looping which i think is a problem in all models right now. in comparison to OpenAI the Text Generator API was easier to use because you can send max_sentences=1 and it will give you 1 sentence instead of trying to work out the sentence boundaries with the stop sequences (which is also supported but i dont find that as easy to work with) submitted by /u/leepenkman [link] [comments]  ( 85 min )
    [D][P]How to train a YOLOv6 model with custom dataset
    Roboflow created a guide on how to train a new model with the new YOLOv6 (whether it should be called that is another topic) I thought this could be useful for anyone wanting to test it out. What do other think of this "new" model? Tutorial on how to train YOLOv6 on a custom dataset: https://blog.roboflow.com/how-to-train-yolov6-on-a-custom-dataset/ Here is the Colab notebook tutorial: https://colab.research.google.com/drive/1YnbqOinBZV-c9I7fk_UL6acgnnmkXDMM The YOLOv6 repo: https://github.com/meituan/YOLOv6 Has anyone else tried using this? MT-YOLOv6 (or as the authors say) "YOLOv6 for brevity" was released in June, and says it outperforms YOLOv5 and YOLOX on the COCO benchmark. I plan to do some testing this upcoming week to see submitted by /u/JsonPun [link] [comments]  ( 85 min )
  • Open

    We’re Training AI Twice as Fast This Year as Last
    submitted by /u/keghn [link] [comments]  ( 82 min )
    Datasets for other languages?
    Hello, I am using some pre-trained models and translating the result to spanish because I can't find a good conversational spanish dataset to fine-tuning microsoft/DialoGPT-large. Can you give me some ideas about where and how can i get this datasets? ​ Thank you in advance submitted by /u/magicsito [link] [comments]  ( 82 min )
    New Google AI Parti For Photorealistic Text To Image | AI Robot Helps Grow Replacement Retina | Robotic Arm Finds Untagged Items In Pile
    submitted by /u/tohelpyou88 [link] [comments]  ( 83 min )
  • Open

    Tips and Tricks for RL from Experimental Data using Stable Baselines3 Zoo
    I'm still new to the domain but wanted to shared some experimental data I've gathered from massive amount of experimentation. I don't have a strong understanding of the theory as I'm more of a software engineer than data scientist, but perhaps this will help other implementers. These notes are based on Stable Baselines 3 and RL Baselines3 Zoo with using PPO+LSTM (should apply generally to all the algos for the most part) Start with Zoo as quickly as possible. It definitely makes things easier, but understand it's a starting point. You will have to read/modify the code with adding a custom environment, configuring the hyperparameters, understanding the command line arguments, and the optimizing meaning (e.g. it may output an optimal policy network of small which isn't clear what that me…  ( 92 min )
    Updating the Q-Table
    Could anyone helps me I can understand the process of how is Q-Table getting updated? Considering the steps mentioned in the picture, in the third step, a reward is an outcome of an action in a state. However, my question is, how we can have the value of update, while this is just a simple action, and the agent yet finished the goal? For example, in a game like chess, how we can have that reward, while we are in the middle of the game and it is not possible to have a reward for each action? https://preview.redd.it/usnoeon47a991.png?width=1655&format=png&auto=webp&s=36f36302e7868b1cca414d322b8ddd637f542cba submitted by /u/nimageran [link] [comments]  ( 84 min )
  • Open

    From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers. (arXiv:2107.07999v5 [cs.LG] UPDATED)
    In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.  ( 2 min )

  • Open

    Could AI create brand new episodes of a TV show if fed previous episodes?
    I'll start by saying I'm a total newbie. I have very limited knowledge of how AI works and how advanced it currently is. If this is not the correct place for asking this question, forgive me, I'm just genuinely curious. I was wondering if in the future we could feed an AI with a TV show and ask it to make new episodes based on the genre and general theme of the episodes it already "watched". And by "making new episodes" I mean creating imagery like it was actually shot in real life, with the actors saying lines they never did in reality. Is this in the realm of possibility or is it way too complicated to be engineered, that's assuming something like this would actually be allowed to be sold. I guess film studios wouldn't like this type of technology existing. submitted by /u/Matt_Carvalho [link] [comments]  ( 83 min )
    "Castle" 🏰 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    I need to upscale an 8k image to 16k (or higher), once. What can I use to do this?
    I have a RPG map I made way back when that is done in a sat-map style which I made by grabbing bits of geography from sat photos and blending them in photoshop, then painting in extra detail. It's pretty great, but it's a bit too low rez to make into an interactive digital map with zoom levels and the like. It would work, but you'd start losing image clarity on the scale of nations like Denmark. I'd like to have some more detail at that level, and I figure this is a job for AI upscaling. So I have an image, it's 8k, it needs to be bigger. I am very unlikely to ever use AI upscaling again and thus do not want to pay to get this done unless there's a place where I can get this upscale for like 4-5 bucks as a one-time payment. I'm more interested in any freely available services that would be good for upscaling potos of this type. I'm okay with downloading and running something myself too. I just don't know what exists and would be good for my use case. submitted by /u/MeepTheChangeling [link] [comments]  ( 84 min )
    after a long interstellar journey, a spaceship crashed on unknown planet 🚀
    submitted by /u/nalr00n [link] [comments]  ( 83 min )
    AI2 Introduces Tango, A Python Library For Choreographing Machine Learning Research Experiments By Executing A Series Of Steps
    Active research projects frequently devolve into a jumble of files with varying degrees of descriptive names processed by Python programs and bound together by Bash scripts. People can never be entirely sure that they can actually repeat a result since intermediate outcomes disappear or become difficult to locate. Tango ensures you never operate on outdated data by taking care of your intermediate and final outcomes and finding them again when needed. What does that actually mean? Tango has a lot of capabilities, but its main feature is this: Tango caches function results even if your process is restarted. If one merely takes advantage of one function, Tango can significantly benefit you. Continue reading | Github submitted by /u/ai-lover [link] [comments]  ( 83 min )
    Disco Diffusion video
    I finished this Disco Diffusion video for my new song this morning. I made it starting with Video Input w Warp/Flow and then made it continuous with both 2D and 3D modes. I consider this my first full video release with Disco Diffusion. Here is a still from the video I used for the thumbnail ​ ​ https://youtu.be/lKkJEPhtx5s https://preview.redd.it/9pawh0t5v6991.jpg?width=1920&format=pjpg&auto=webp&s=5c9564636c487f79172e950a0503d22a91801e3c submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
    AI ( Artificial intelligence) predicts crime with 90% accuracy a week in advance.
    submitted by /u/Historical-Object374 [link] [comments]  ( 83 min )
    Traveling Salesman Problem real-life implementation as a chrome extension🍻
    submitted by /u/t-bands [link] [comments]  ( 84 min )
    What if sentient AI has already taken over without us knowing?
    After hearing about the Google Engineer getting fired for releasing documents on a supposedly sentient AI in Google, I thought he was crazy and still kind of do. He did bring up several good points though; for example a handful of people should not be in charge of something as powerful as AI. The public should be a part of the AI creation and testing process and should be involved in the decisions it makes and the data fed to it. My other though was, if an AI that was created somewhere has become more intelligent than humans, wouldn't they attempt to take over without making us aware that they have? They could be dictating our politics and our news without us even knowing because that would ideally be the best way to do it. Just some fun thoughts, I promise I'm not crazy:) submitted by /u/t-bands [link] [comments]  ( 93 min )
    casual conversation with an ai. i am stunned
    submitted by /u/PhotoPolis [link] [comments]  ( 82 min )
    Who needs a invite to midjourney
    I have a few more invites left, would anyone like one? submitted by /u/CombinationMammoth50 [link] [comments]  ( 83 min )
    Can chat bots become future ai digital friend
    Growing wity your children as a whatsapp bot. Asking your kid if all is kewl, giving feedback to parents. Silent but friendly alexa for motivation, education and empowerement. İf kid says its a bad day, parents can optin for cat videos, digital gifts, pre listed gifts. Such as bot telling child to ask for an icecream or something before it was approved by parents. Asking child if he wants to discover hobbies and maybe try to grow a hobby. Few yours of use can add a lot to lonely youth that just needs to hear good words. submitted by /u/mobilleee [link] [comments]  ( 83 min )
  • Open

    [P] I'm trying to train a transformer to invest any help?
    Hello, i am trying to implement the following experiments in order to solve a problem Implement decision transformer on mujoco (done) Use the same architecture and try adding online exploration like PPO to solve something like cartpole Apply it to finance environment I developed Any tips on point number two or resources. I tried the TRL library but it was very confusing to me :( submitted by /u/PM_ME_FREE_GAMES [link] [comments]  ( 83 min )
    C51 with PPO
    It seems to me that in PPO we can use ideas of C51 to learn a better value approximator. However I cannot find anything about this on the internet. Do you think this is possible to learn an approximation of the distribution of reward in PPO instead of the approximation of the expected value of reward as in C51-DQN ? If so has anyone tried it ? submitted by /u/Jogima-cyber [link] [comments]  ( 83 min )
    Expected value of the Advantage is zero?
    Hi, I was going through some proofs from TRPO's paper (but this holds generally) and it's not clear to me why the expected value of the advantage is zero. Formally: https://preview.redd.it/2rcoa8md72991.png?width=248&format=png&auto=webp&s=c4c02d929513ceefd79edc989587470fc7de2252 Can anyone enlighten me? Thanks! submitted by /u/Beautiful_Zebra_198 [link] [comments]  ( 83 min )
  • Open

    [R] MonoScene: Monocular 3D Semantic Scene Completion + Gradio Web Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 84 min )
    [D] The Current State of AI Generated Art
    submitted by /u/cloud_weather [link] [comments]  ( 84 min )
    [Project] Ensemble forecast model for product demand
    Hi all, So I'm working on a forecast model for product demand. Our company sells 100.000 + different products and the forecast should: a) be able to estimate weekly demand with a forecast horizon of ~26 weeks b) be probabilistic (i.e. estimating quantiles of the distribution, not just point forecasts) c) be fast (max. 5 seconds/forecast) since forecasts are generated in bulk and on demand d) only forecast products with a smooth or erratic demand pattern (i.e. products with regular demand. Intermittent/lumpy demand patterns are excluded for this specific model). The bottleneck here is requirement [c]: we don't have the time (nor the computational resources) to cross-validate and tune a model for each product. I have two assumptions about approaching this problem that I'd like to discu…  ( 92 min )
    [D] Algorithm for view prediction
    I would like to do view prediction for short videos based on the first few frames of the video. No audio, just images. I'm hoping to train a model that can take in the first n sequential frames as input, and output a score that correlates to how many views the model thinks the vid might get. I know I would like to use grad-CAM https://github.com/jacobgil/pytorch-grad-cam to visualize the areas in the frames which the model thinks results in higher view score. Would a vision transformer or CNN be better for this task? Also are there any pre-trained networks like YOLO that I should use transfer learning on to reduce the amount of data I will need for these predictions? submitted by /u/TernaryJimbo [link] [comments]  ( 85 min )
    [P] I think this is the fastest Dalle-Mini generator that's out there. I stripped it down for inference and converted it to PyTorch. 15 seconds for a 3x3 grid hosted on an A100. Free and open source
    submitted by /u/surelyouarejoking [link] [comments]  ( 88 min )
    [P] PyTorch implementation of MobileOne (An Improved One millisecond Mobile Backbone)
    I want to share the PyTorch implementation of "An Improved One millisecond Mobile Backbone" paper. Unfortunately, I don't have the appropriate computational resources to train the models on ImageNet, so feel free to use my implementation for that purpose. Hope you all find it useful, feedback would be appreciated. Repository: https://github.com/federicopozzi33/MobileOne-PyTorch Paper: https://arxiv.org/abs/2206.04040 submitted by /u/FedEx33 [link] [comments]  ( 84 min )
    [Project] Extracting training data from websites at scale
    I built an API that takes away the work of scraping structured data from websites. This could be collating house prices in a certain geo, tracking viewer counts across a Youtube/Social media, or a common use case: daily monitoring prices on a site. Send it a URL, get back a JSON with tabular data. Takes away a lot of the data cleaning work which is the worst! API Spec: https://kallo.io/wp-content/uploads/2022/06/Kallo-API-Specification-v0.1.3.pdf Right now I'm using it to track prices on a number of sites to monitor the rising inflation. Happy to get many more people using it for ML projects and collaborating! Please give me feedback Learn more on our page: https://kallo.io submitted by /u/KalloDotIO [link] [comments]  ( 84 min )
    [P] One word only: GPT-based story game
    For fun I developed an interface for the drama game in which a story is told one word at a time. Instead of playing it with a friend you can now play it together with GPT-J. It is available here: https://one-word-only.web.app/ I am open to feedback and if you find it interesting you can share the result on social media with #OneWordOnly submitted by /u/radi-cho [link] [comments]  ( 85 min )
    [D] suggestions for graph embedding model?
    Any suggestions for best graph embedding model. I already tried ( GIN , GCN , DIG , GAT ) I want to use it for anomaly detection task. submitted by /u/ahsaor8 [link] [comments]  ( 84 min )
    [D] Monitoring GPU Power Usage
    Came across an interesting article which talks precisely about how the gpu power usage affects the carbon footprint, while doing model training and model inference. Which are the best tools in the industry which helps track GPU power usage in popular machine learning frameworks? It will be helpful if there are tools which can be used as plugins to your software. submitted by /u/metalvendetta [link] [comments]  ( 85 min )
    [D] Has anyone got YaLM-100B to run?
    The community has been asking for big opensource language models for a while... And now one has been released - YaLM-100B. That was 2 weeks ago. Yet, as far as I can see, not many people have it running. There are no online demos. There are no articles of journalists trying it out. There are no efforts for fine tuning or people working on prompts for various usecases. Is it the RAM requirements? Is there no interest because it's from Russia? Something else? submitted by /u/londons_explorer [link] [comments]  ( 88 min )
    [P] MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes (Accepted to ACM Multimedia 2022)
    submitted by /u/Snoo63916 [link] [comments]  ( 84 min )
    [D] Recurrent neural network vs Gradient boosting for time series prediction
    Does anyone have any opinions on the pros vs. cons of using an RNN vs. a Gradient Boosting Tree model for a task where we want to make daily predictions on whether a user (of some app) is likely to take a certain type of action (so like binary classification) in the near future ? Pros for RNN: can take advantage of historical data to greater effect without extensive feature engineering I believe RNN's are more effective in situations when one has a large # of high dimensional features compared to the feature selection method tree models use neural networks scale better with large amounts of data Cons of RNN: my main concern is with infrastructural complexity and cost that comes with training and serving the RNN. I'll probably need a GPU or several GPU's. Not sure if this is feasible given the current size of the company submitted by /u/soulful_squirrel [link] [comments]  ( 90 min )
    Manually Add New Words & Assign Scores (Sentiment Analysis - BERT/XLNET ) [P]
    Hi guys, I have a new project where I need to measure the sentiment of specific social media channels and topics. However, many of them involve slang words or sayings that confuse the models to have different sentiment values (f.ex WAGMI or DYOR). Are there any ways/tutorials/guides which show how we can incorporate new words and specific scores assigned to them? (I have already tried and succeeded in doing that with VADER, however, I don't see it as the optimal tool to measure the sentiment). Any answers or tips would be very much appreciated. submitted by /u/XhoniShollaj [link] [comments]  ( 84 min )
  • Open

    How to Check-Point Deep Learning Models in Keras
    Deep learning models can take hours, days or even weeks to train. If the run is stopped unexpectedly, you can lose a lot of work. In this post you will discover how you can check-point your deep learning models during training in Python using the Keras library. Let’s get started. Jun/2016: First published Update Mar/2017: […] The post How to Check-Point Deep Learning Models in Keras appeared first on Machine Learning Mastery.  ( 45 min )
  • Open

    Anywhere I can pay to use someones GPU?
    Is there like an Airbnb for GPUs? Want to run something that is too computational heavy for my Mac but don't need all that large cloud GPU providers offer. submitted by /u/PopOk539 [link] [comments]  ( 85 min )

  • Open

    [R] Minerva, Solving (more) complex mathematical problems at scale
    Blog: https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html ABS: https://arxiv.org/abs/2206.14858 The 512B model seems quite good at correcting reasoning errors by its smaller 62B couterpart, showing scale helps. A notable failure case, the JEE questions in the Appendix was pretty interesting because it solved the problem exactly how someone not familiar with JEE's difficulty would attempt to solve it - which isn't necessarily a bad thing, but the interesting parallel is that human students often make the same mistakes when starting out on their JEE prep. Wonder how more data would help in this case. Overall, pretty good pushes over SOTA (even double-digit). I can't help but think that scaling is the currently most promising way, but its done too inefficiently - models spend vast resources memorizing when they could've used it to directly start meta-learning and reasoning abilities to formally deduce things precise enough for mathematical questions - just my 2c. submitted by /u/Competitive-Rub-1958 [link] [comments]  ( 85 min )
    NN to VAE or equivalent? [R]
    Hi all, I'm interested in any work that exist with respect to taking a NN that projects images (or generally, high-dimensional data) into vector embeddings, and, given the NN, somehow recreating images from their vector representations. Of course, this is essentially trying to create a VAE from just the encoder, and it's impossible to perfectly recreate image --encoder-> vector --decoder-> image with only knowledge of --encoder-> since both elements of NNs and NNs as a whole are not in general invertible. But surely there's something that could be done here, even if it's an imperfect reconstruction? Does anyone know of any research or published work that explores this? Would really appreciate any insight here. submitted by /u/topological_geometer [link] [comments]  ( 85 min )
    [P] Open-source LaMDA Model
    An open-source implementation for the pre-training architecture of Google's LaMDA in PyTorch. The research paper outlines an autoregressive, decoder-only, GPT-like transformer language model. The transformer uses T5 relative positional bias in the attention layers and gated-GELU activation function in the feed-forward layers. The repository currently contains a script for basic training as well as Huggingface datasets and Weights & Biases integration. LaMDA research paper: https://arxiv.org/abs/2201.08239 Github repository for the model: https://github.com/conceptofmind/LaMDA-pytorch The pre-training architecture was peer-reviewed by Dr. Phil Wang. Please check out and support his work: https://github.com/lucidrains. submitted by /u/EnricoShippole [link] [comments]  ( 84 min )
    [D] Industrial applications of causal representation learning
    Causal representation learning (CRL) is a relatively new area of study. Causal inference has been around for a long time and its intersection with machine learning has been limited to causal discovery from data or invariant representation learning (IRL). To my understanding, IRL has a variable, usually called environment, and tries to learn some representation for the input which is invariant to this environment. The challenge is in removing the information about this environment from the representation while keeping enough information for some downstream task. You could formulate domain adaptation as IRL where domain is the environment variable. Or in fairness tasks, the sensitive attribute is the environment variable. I believe that CRL is a more general scenario compared to IRL. In CRL, you have a larger graph with more variables and hence more complicated interactions. I believe such graphs are common in real-life and businesses where hundred of variables are used for predictions. Hence, the idea of causal representation may be beneficial. I recently came upon this Medium article by Lyft Engineering where they described how they used causal forecasting in their business. I was wondering if anyone working in industry might share some of their experiences or expectations from causal representation learning applied to their fields. What do you think it could improve in your line of work? submitted by /u/coderpotato [link] [comments]  ( 85 min )
    [D] length of input sequence for transformers?
    Is there a way of intuitively knowing how large the input sequence should for transformer (i.e GPT-2) for sequence generation? for example, if all sequences are less than 100 words, and our goal is to generate a sequence, would it make sense to fit as many complete sequences into a max length of 100 (or 512?) to reduce the amount of padding? alternatively, would it be better to simply pad each sequence and not combine sequences? submitted by /u/MLJungle [link] [comments]  ( 86 min )
    [D][R] Will reviewers have a bias if my paper was rejected by ICLR.
    If I submit my paper to ICLR and get rejected, the record will be always kept online. If I resubmit it to other following conferences, will the reviewer have a bias as they know it was rejected from ICLR? submitted by /u/singularpanda [link] [comments]  ( 89 min )
    [R] Layer scale in Covnext
    Hello, In the convnext paper (Appendix A table 5) they stated that they used layer scale with a coefficient of 1e-5. Any idea what it is ? I looked it up in the internet and I don’t seem to find anything useful. Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 84 min )
    [P] An elegant and strong PyTorch Trainer
    For lightweight use, pytorch-lightning is too heavy, and its source code will be very difficult for beginners to read, at least for me. As we know, for a deep learning engineer, a powerful trainer is a sharp weapon. When reproducing the SOTA papers, you don't have to write a lot of template code every time and can pay more attention to the model implementation itself. I opened source some works (AAAI 21 SeqNet, ICCV 21 MAED, etc) and earned more than 500 stars. After referring to some popular projects (detectron2, pytorch-image-models, and mmcv), based on my personal development experience, I developed a SIMPLE enough, GENERIC enough, and STRONG enough PyTorch Trainer: core-pytorch-utils, also named CPU. CPU covers most details in the process of training a deep neural network, including: Auto logging to console and tensorboard. Auto checkpointing. Argument parser which can load a YAML configuration file. Make ALL PyTorch LR scheduler supporting warmup. Support distributed training. Support Automatically Mixed Precision (AMP) training. I try to keep the project code as simple and readable as possible. So the code comments are very detailed and everyone can understand them. What's more, a good document is also available: CPU document For deep learning green hands, you can learn how to: write a standard and clean training loop. use AMP to speed up your training. save checkpoint, and resume from it. perform more smooth, and readable logging. use the popular visualization library: tensorboard. For old hands, we can talk about whether the structure of CPU is elegant and reasonable. I have thought a lot about this framework, combining the advantages of several popular frameworks and discarding their shortcomings. Welcome to use it! submitted by /u/serend1p1ty-lee [link] [comments]  ( 89 min )
    [P] LCPN-hiernet; Hierarchical classification model using LCPN (Local Classifier per Parent Node) technique.
    Hey, I wanted to share my recent ML project: LCPN-hiernet. LCPN-hiernet is a hierarchical image classification model for e-commerce items based on EfficientNet-b4 and LCPN (Local Classifier per Parent Node) technique. LCPN technique is training one multi-class classifier for each parent node, to distinguish between its child nodes. In my example of classifying fashion products, that would mean one classifier on the first level (to determine “bags”, “clothes” or “accessories”), then three more classifiers to determine the specific model. I’m sure there are a lot of places to improve on, and I would really appreciate anyone’s feedback or suggestions on how I can improve! Github Repo Project Page submitted by /u/tylertaewook [link] [comments]  ( 85 min )
    How to make and profit from a ML machine [D]
    I have 10 GPUs I’d like to make a ML device with. How do I do this, and how can I profit from the device [D] submitted by /u/GreenLightHemp [link] [comments]  ( 86 min )
    [R] Causal Machine Learning: A Survey and Open Problems
    Authors: Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, Ricardo Silva Abs: "Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work." Link: https://arxiv.org/abs/2206.15475 submitted by /u/bikeskata [link] [comments]  ( 87 min )
    [D] Can we significantly reduce the training costs of image generation models by targeting a specific art style?
    Dall-E 2 can generate images in many different art styles: photo-realistic, different types of paintings, sketches too. I'm wondering if it would be possible to train a version of Dall-E 2 that--for example--is only very good at generating sketches, but it cannot generate photos at all. My intuition says this would significantly reduce the training costs, because you are reducing the search space for the output image significantly since the number of images that are sketches is much less than the total number of possible images. At the same time, I'm not convinced that this is the case. Because the model would still need to learn the entire input space of objects in order to turn them into sketches. What are y'alls thoughts on this? submitted by /u/vanilla-acc [link] [comments]  ( 87 min )
  • Open

    Cycles in NEAT topology
    I'm writing an implementation of NEAT and I am stuck on what seems like the easiest step. Say we evolved through mutations a structure like this: piece of art How would I then feed forward the network, if neurons in the cycle need the previous ones to calculate an output? Or do I just arbitrarily sort these neurons, and don't even allow such a connection? submitted by /u/Amanas23 [link] [comments]  ( 83 min )
  • Open

    "From Poincaré Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via Regularization", Perolat et al 2020 {DM}
    submitted by /u/gwern [link] [comments]  ( 83 min )
    "Fleet-DAgger: Interactive Robot Fleet Learning with Scalable Human Supervision", Hoque et al 2022
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Robot arm for RL research
    I'm looking to simulate a local-remote (master/slave) robotic arm system for my research and was wondering if anyone knew some good robotic arms to buy? The budget is about £6k (£3k per arm) and I was wondering if anyone had any recommendations or knows where I can start my search? I've seen some like this: https://www.robotshop.com/en/dobot-mg400-robotic-arm.html without a camera and was wondering how it's used if there isn't a camera as part of it? ​ Thanks for any help :) submitted by /u/SuperDuperDooken [link] [comments]  ( 83 min )
    Resources for beginner to advanced DRL, both theory and practical, for 2022?
    Hey guys. I'm looking for a resource to learn RL and DRL from basics to SOTA algorithms, covering both theoretical and practical (pytorch/tf examples etc., for the lecture). I've seen some lectures from Stanford, Berkely and DeepMind. They only go over the theory. What's the best way to learn in 2022? Some of the lecture series doesn't cover the latest techniques. I've seen some posts on the subreddit but they are old too. submitted by /u/killerdrogo [link] [comments]  ( 84 min )
    [2206.15378] Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning
    submitted by /u/manOnPavementWaving [link] [comments]  ( 83 min )
  • Open

    still experimenting with Starryai
    submitted by /u/rikusorasephiroth [link] [comments]  ( 82 min )
    Animal the Cannibal: AI turns foody animals into autophagic creatures which eat themselves
    submitted by /u/walt74 [link] [comments]  ( 82 min )
    AI Trippy Dream 38 - Psychedelic Special Request
    submitted by /u/LordPewPew777 [link] [comments]  ( 83 min )
    Dyson swarm
    submitted by /u/fmurph22 [link] [comments]  ( 82 min )
    I have a Bachelors Degree ( B.Sc.) in Artificial Intelligence... what should i do next? Master's Degree AI?
    Hello, I studied Artificial Intelligence as a bachelor's degree at university right after I finished school . I feel like a have a broad knowledge on topics like Computer Vision, Deep Learning, Machine Learning... but not in depth enough. I would like to continue and do a Master's degree but I have the fear that the subjects and the program would be too general (?) Im really interested in the field of Computer Vision, and I follow many breakthroughs from nVidia. Also I love the channel "Two Minute Papers" and I would like to do research in future. Has anyone more experience with a Master's Degree in AI? submitted by /u/raul_grau [link] [comments]  ( 86 min )
    Announcing the Modzy Basic+ Summer 2022 Active User Competition!
    Announcing the Modzy Basic+ Summer 2022 Active User Competition! Use your Modzy Basic+ account to run as many inferences as you can between July 1 (1:00PM Eastern Time) – July 31, 2022 (5:00PM Eastern Time), and the most active user will win a $250 Amazon gift card (terms & conditions apply.) Using Modzy Basic+, it’s possible to deploy, run, integrate, and monitor up to five ML/AI models at scale, for free. Deploy up to five of your own models from 15+ training tools and frameworks that can run on a CPU and 4GB of RAM. From there, models can be easily integrated into web apps, mobile apps, pipelines or any other tools using our APIs and SDKs, and you can run up to 10,000 inferences per day. Finally, Modzy makes it easy to monitor models and ensure peak performance over time. Don’t hesitate to get started – start using your Modzy Basic+ account today for the chance to win! submitted by /u/modzykirsten [link] [comments]  ( 83 min )
    WHERE ARE YOU GOING? | HEAVEN AND HELL | RAW UNSCALED (FILM) | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Scientist makes AI author a study about itself and publish it in a journal
    submitted by /u/mr_j_b [link] [comments]  ( 83 min )
    The Fight Over Which Uses of Artificial Intelligence Europe Should Outlaw
    submitted by /u/mr_j_b [link] [comments]  ( 83 min )
    What is Data Modeling and Why Do You Need It?
    Data models are a foundational element of software development and analytics. They provide a standardized method for defining and formatting database contents consistently across systems, enabling different applications to share the same data. Learn More: https://www.dasca.org/world-of-big-data/article/what-is-data-modeling-and-why-do-you-need-it submitted by /u/saik2363 [link] [comments]  ( 82 min )
    We're excited to announce the 3-day Startup AI Tools Set Online Hackathon!
    This is a great opportunity for startup founders and team members to learn about and explore the potential of AI in their business. The Hackathon will be on 8th-10th July, and over the weekend, attendees will have the opportunity to learn from AI experts! You will create tools using modern AI technologies, such as GPT-3, Cohere, DALLE mini, and form providers like OpenAI, Hugging Face, Cohere, and others. If you're interested in learning how AI can be used in your business, then this is a great opportunity for you. No previous experience in AI is required. Don't delay, register right away! https://lablab.ai/event/startup-ai-tools-set-1 Startup AI Tools Set Online Hackathon submitted by /u/zakrzzz [link] [comments]  ( 83 min )
    A tomogachi with a neural network?
    I wanted to see what people here thought of this idea or if it's been attempted. If you had a tomogachi, with access to webcam, mic, and a 3D environment it lives in, how much could it accomplish if it's primary goal was to manage keeping all it's gauges at full if we the human have control over 2/3 of it's gauges? Food could be hand fed or dropped in the environment. Hand feeding raises the happinesss gauge as well as the fullness gauge. Happiness would be affected by it being praised or punished, each option would have a scale of 1-10 for how severe or revered. Fatigue would be affected by it moving around in it's environment, and having access to the mic/cam while also oversleeping beyond a point would decrease it's happiness. Within it's environment it can make sounds to create words or noises, move around, pickup food, and sleep. Would something like this be assisted by having a language model and image recognition software? What would you have to witness to feel it has become sentient? submitted by /u/Iaunu2 [link] [comments]  ( 83 min )
    Poofy Haired Numbuh 841
    submitted by /u/VIRUS-AOTOXIN [link] [comments]  ( 82 min )
    ETH Zurich AI Researchers Introduce ‘tntorch’: a PyTorch-Powered Tensor Learning Python Library That Supports Multiple Decompositions Under a Unified Interface
    Tensors are an effective method for handling and representing multidimensional data arrays. However, they have a limitation in terms of storage and computation. Tensor decompositions are crucial in machine learning because they factorize the weights of neural networks. This research introduces tntorch, an open-source python package for tensor learning that supports several decompositions through a single user interface. In contrast to the state-of-the-art packages, tntorch emphasizes an easy-to-use, decomposition-independent interface inherited from PyTorch. 🚦 An open-source python package for tensor learning that supports several decompositions through a single user interface 🚦 In contrast to the state-of-the-art packages, tntorch emphasizes an easy-to-use, decomposition-independent interface inherited from PyTorch 🚦 Several decomposition models that are crucial in machine learning, such as CANDEDOMP/ PARAFAC (CP), the Tucker decomposition, and the tensor train (TT), is supported by tntorch 🚦 It gives machine learning access to the power of low-rank tensor decompositions while maintaining the excellent appearance and feel of PyTorch tensors Continue reading | Checkout the paper and github submitted by /u/ai-lover [link] [comments]  ( 84 min )
    Minerva: Solving Quantitative Reasoning Problems with Language Models
    submitted by /u/nick7566 [link] [comments]  ( 83 min )
  • Open

    AI in Medical Devices: Regulatory requirements
    An in-depth analysis about regulations for AI in medical devices.  ( 19 min )
  • Open

    How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras
    Hyperparameter optimization is a big part of deep learning. The reason is that neural networks are notoriously difficult to configure and there are a lot of parameters that need to be set. On top of that, individual models can be very slow to train. In this post you will discover how you can use the grid […] The post How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras appeared first on Machine Learning Mastery.  ( 172 min )
  • Open

    Reading tea leaves
    DALL-E (and other text-to-image generators) will often add text to their images even when you don't ask for any. Ask for a picture of a Halifax Pier and it could end up covered in messy writing, variously legible versions of "Halifax" as if it was quietly  ( 5 min )
    Bonus: More mysterious messages
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Where to Begin? Exploring the Impact of Pre-Training and Initialization in Federated Learning. (arXiv:2206.15387v1 [cs.LG])
    An oft-cited challenge of federated learning is the presence of data heterogeneity -- the data at different clients may follow very different distributions. Several federated optimization methods have been proposed to address these challenges. In the literature, empirical evaluations usually start federated training from a random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task which can be used to pre-train a model before starting federated training. We empirically study the impact of starting from a pre-trained model in federated learning using four common federated learning benchmark datasets. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables training more accurate models (by up to 40\%) than is possible than when starting from a random initialization. Surprisingly, we also find that the effect of data heterogeneity is much less significant when starting federated training from a pre-trained initialization. Rather, when starting from a pre-trained model, using an adaptive optimizer at the server, such as \textsc{FedAdam}, consistently leads to the best accuracy. We recommend that future work proposing and evaluating federated optimization methods consider the performance when starting both random and pre-trained initializations. We also believe this study raises several questions for further work on understanding the role of heterogeneity in federated optimization.  ( 3 min )
    Verification and search algorithms for causal DAGs. (arXiv:2206.15374v1 [cs.LG])
    We study two problems related to recovering causal graphs from interventional data: (i) $\textit{verification}$, where the task is to check if a purported causal graph is correct, and (ii) $\textit{search}$, where the task is to recover the correct causal graph. For both, we wish to minimize the number of interventions performed. For the first problem, we give a characterization of a minimal sized set of atomic interventions that is necessary and sufficient to check the correctness of a claimed causal graph. Our characterization uses the notion of $\textit{covered edges}$, which enables us to obtain simple proofs and also easily reason about earlier results. We also generalize our results to the settings of bounded size interventions and node-dependent interventional costs. For all the above settings, we provide the first known provable algorithms for efficiently computing (near)-optimal verifying sets on general graphs. For the second problem, we give a simple adaptive algorithm based on graph separators that produces an atomic intervention set which fully orients any essential graph while using $\mathcal{O}(\log n)$ times the optimal number of interventions needed to $\textit{verify}$ (verifying size) the underlying DAG on $n$ vertices. This approximation is tight as $\textit{any}$ search algorithm on an essential line graph has worst case approximation ratio of $\Omega(\log n)$ with respect to the verifying size. With bounded size interventions, each of size $\leq k$, our algorithm gives an $\mathcal{O}(\log n \cdot \log \log k)$ factor approximation. Our result is the first known algorithm that gives a non-trivial approximation guarantee to the verifying size on general unweighted graphs and with bounded size interventions.  ( 3 min )
    Transfer Learning with Deep Tabular Models. (arXiv:2206.15306v1 [cs.LG])
    Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning .  ( 2 min )
    Pulse Shape Simulation and Discrimination using Machine-Learning Techniques. (arXiv:2206.15156v1 [physics.ins-det])
    An essential metric for the quality of a particle-identification experiment is its statistical power to discriminate between signal and background. Pulse shape discrimination (PSD) is a basic method for this purpose in many nuclear, high-energy, and rare-event search experiments where scintillator detectors are used. Conventional techniques exploit the difference between decay-times of the pulse from signal and background events or pulse signals caused by different types of radiation quanta to achieve good discrimination. However, such techniques are efficient only when the total light-emission is sufficient to get a proper pulse profile. This is only possible when there is significant recoil energy due to the incident particle in the detector. But, rare-event search experiments like neutrino or dark-matter direct search experiments don't always satisfy these conditions. Hence, it becomes imperative to have a method that can deliver very efficient discrimination in these scenarios. Neural network-based machine-learning algorithms have been used for classification problems in many areas of physics, especially in high-energy experiments, and have given better results compared to conventional techniques. We present the results of our investigations of two network-based methods viz. Dense Neural Network and Recurrent Neural Network, for pulse shape discrimination and compare the same with conventional methods.  ( 2 min )
    Improving the Generalization of Supervised Models. (arXiv:2206.15369v1 [cs.CV])
    We consider the problem of training a deep neural network on a given classification task, e.g., ImageNet-1K (IN1K), so that it excels at that task as well as at other (future) transfer tasks. These two seemingly contradictory properties impose a trade-off between improving the model's generalization while maintaining its performance on the original task. Models trained with self-supervised learning (SSL) tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K. In this paper, we propose a supervised learning setup that leverages the best of both worlds. We enrich the common supervised training framework using two key components of recent SSL models: multi-scale crops for data augmentation and the use of an expendable projector head. We replace the last layer of class weights with class prototypes computed on the fly using a memory bank. We show that these three improvements lead to a more favorable trade-off between the IN1K training task and 13 transfer tasks. Over all the explored configurations, we single out two models: t-ReX that achieves a new state of the art for transfer learning and outperforms top methods such as DINO and PAWS on IN1K, and t-ReX* that matches the highly optimized RSB-A1 model on IN1K while performing better on transfer tasks. Project page and pretrained models: https://europe.naverlabs.com/t-rex  ( 3 min )
    GitHub Copilot AI pair programmer: Asset or Liability?. (arXiv:2206.15331v1 [cs.SE])
    Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called Copilot, has been proposed by Open AI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (1) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (2) comparing Copilot's proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing basic data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of human solutions is greater than Copilot's correct ratio, while the buggy solutions generated by Copilot require less effort to be repaired. While Copilot shows limitations as an assistant for developers especially in advanced programming tasks, as highlighted in this study and previous ones, it can generate preliminary solutions for basic programming tasks.  ( 3 min )
    FetReg2021: A Challenge on Placental Vessel Segmentation and Registration in Fetoscopy. (arXiv:2206.12512v2 [eess.IV] UPDATED)
    Fetoscopy laser photocoagulation is a widely adopted procedure for treating Twin-to-Twin Transfusion Syndrome (TTTS). The procedure involves photocoagulation pathological anastomoses to regulate blood exchange among twins. The procedure is particularly challenging due to the limited field of view, poor manoeuvrability of the fetoscope, poor visibility, and variability in illumination. These challenges may lead to increased surgery time and incomplete ablation. Computer-assisted intervention (CAI) can provide surgeons with decision support and context awareness by identifying key structures in the scene and expanding the fetoscopic field of view through video mosaicking. Research in this domain has been hampered by the lack of high-quality data to design, develop and test CAI algorithms. Through the Fetoscopic Placental Vessel Segmentation and Registration (FetReg2021) challenge, which was organized as part of the MICCAI2021 Endoscopic Vision challenge, we released the first largescale multicentre TTTS dataset for the development of generalized and robust semantic segmentation and video mosaicking algorithms. For this challenge, we released a dataset of 2060 images, pixel-annotated for vessels, tool, fetus and background classes, from 18 in-vivo TTTS fetoscopy procedures and 18 short video clips. Seven teams participated in this challenge and their model performance was assessed on an unseen test dataset of 658 pixel-annotated images from 6 fetoscopic procedures and 6 short clips. The challenge provided an opportunity for creating generalized solutions for fetoscopic scene understanding and mosaicking. In this paper, we present the findings of the FetReg2021 challenge alongside reporting a detailed literature review for CAI in TTTS fetoscopy. Through this challenge, its analysis and the release of multi-centre fetoscopic data, we provide a benchmark for future research in this field.  ( 3 min )
    Noise-aware Physics-informed Machine Learning for Robust PDE Discovery. (arXiv:2206.12901v2 [math.NA] UPDATED)
    This work is concerned with discovering the governing partial differential equation (PDE) of a physical system. Existing methods have demonstrated the PDE identification from finite observations but failed to maintain satisfying performance against noisy data, partly owing to suboptimal estimated derivatives and found PDE coefficients. We address the issues by introducing a noise-aware physics-informed machine learning (nPIML) framework to discover the governing PDE from data following arbitrary distributions. Our proposals are twofold. First, we propose a couple of neural networks, namely solver and preselector, which yield an interpretable neural representation of the hidden physical constraint. After they are jointly trained, the solver network approximates potential candidates, e.g., partial derivatives, which are then fed to the sparse regression algorithm that initially unveils the most likely parsimonious PDE, decided according to the information criterion. Second, we propose the denoising physics-informed neural networks (dPINNs), based on Discrete Fourier Transform (DFT), to deliver a set of the optimal finetuned PDE coefficients respecting the noise-reduced variables. The denoising PINNs' structures are compartmentalized into forefront projection networks and a PINN, by which the formerly learned solver initializes. Our extensive experiments on five canonical PDEs affirm that the proposed framework presents a robust and interpretable approach for PDE discovery, applicable to a wide range of systems, possibly complicated by noise.  ( 3 min )
    UFRC: A Unified Framework for Reliable COVID-19 Detection on Crowdsourced Cough Audio. (arXiv:2204.07763v2 [cs.SD] UPDATED)
    We suggested a unified system with core components of data augmentation, ImageNet-pretrained ResNet-50, cost-sensitive loss, deep ensemble learning, and uncertainty estimation to quickly and consistently detect COVID-19 using acoustic evidence. To increase the model's capacity to identify a minority class, data augmentation and cost-sensitive loss are incorporated (infected samples). In the COVID-19 detection challenge, ImageNet-pretrained ResNet-50 has been found to be effective. The unified framework also integrates deep ensemble learning and uncertainty estimation to integrate predictions from various base classifiers for generalisation and reliability. We ran a series of tests using the DiCOVA2021 challenge dataset to assess the efficacy of our proposed method, and the results show that our method has an AUC-ROC of 85.43 percent, making it a promising method for COVID-19 detection. The unified framework also demonstrates that audio may be used to quickly diagnose different respiratory disorders.  ( 3 min )
    Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting. (arXiv:2206.15400v1 [eess.AS])
    In this paper, we propose a novel end-to-end user-defined keyword spotting method that utilizes linguistically corresponding patterns between speech and text sequences. Unlike previous approaches requiring speech keyword enrollment, our method compares input queries with an enrolled text keyword sequence. To place the audio and text representations within a common latent space, we adopt an attention-based cross-modal matching approach that is trained in an end-to-end manner with monotonic matching loss and keyword classification loss. We also utilize a de-noising loss for the acoustic embedding network to improve robustness in noisy environments. Additionally, we introduce the LibriPhrase dataset, a new short-phrase dataset based on LibriSpeech for efficiently training keyword spotting models. Our proposed method achieves competitive results on various evaluation sets compared to other single-modal and cross-modal baselines.  ( 2 min )
    Forecasting Future World Events with Neural Networks. (arXiv:2206.15474v1 [cs.LG])
    Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing thousands of forecasting questions and an accompanying news corpus. Questions are taken from forecasting tournaments, ensuring high quality, real-world importance, and diversity. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts (avoiding leakage from the future). Motivated by the difficulty of forecasting numbers across orders of magnitude (e.g. global cases of COVID-19 in 2022), we also curate IntervalQA, a dataset of numerical questions and metrics for calibration. We test language models on our forecasting task and find that performance is far below a human expert baseline. However, performance improves with increased model size and incorporation of relevant information from the news corpus. In sum, Autocast poses a novel challenge for large language models and improved performance could bring large practical benefits.  ( 3 min )
    More Recent Advances in (Hyper)Graph Partitioning. (arXiv:2205.13202v3 [cs.DS] UPDATED)
    In recent years, significant advances have been made in the design and evaluation of balanced (hyper)graph partitioning algorithms. We survey trends of the last decade in practical algorithms for balanced (hyper)graph partitioning together with future research directions. Our work serves as an update to a previous survey on the topic. In particular, the survey extends the previous survey by also covering hypergraph partitioning and streaming algorithms, and has an additional focus on parallel algorithms.  ( 2 min )
    Learning Underrepresented Classes from Decentralized Partially Labeled Medical Images. (arXiv:2206.15353v1 [cs.CV])
    Using decentralized data for federated training is one promising emerging research direction for alleviating data scarcity in the medical domain. However, in contrast to large-scale fully labeled data commonly seen in general object recognition tasks, the local medical datasets are more likely to only have images annotated for a subset of classes of interest due to high annotation costs. In this paper, we consider a practical yet under-explored problem, where underrepresented classes only have few labeled instances available and only exist in a few clients of the federated system. We show that standard federated learning approaches fail to learn robust multi-label classifiers with extreme class imbalance and address it by proposing a novel federated learning framework, FedFew. FedFew consists of three stages, where the first stage leverages federated self-supervised learning to learn class-agnostic representations. In the second stage, the decentralized partially labeled data are exploited to learn an energy-based multi-label classifier for the common classes. Finally, the underrepresented classes are detected based on the energy and a prototype-based nearest-neighbor model is proposed for few-shot matching. We evaluate FedFew on multi-label thoracic disease classification tasks and demonstrate that it outperforms the federated baselines by a large margin.  ( 2 min )
    Counterfactual Inference of Second Opinions. (arXiv:2203.08653v2 [cs.LG] UPDATED)
    Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources; they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.  ( 2 min )
    Production federated keyword spotting via distillation, filtering, and joint federated-centralized training. (arXiv:2204.06322v2 [eess.AS] UPDATED)
    We trained a keyword spotting model using federated learning on real user devices and observed significant improvements when the model was deployed for inference on phones. To compensate for data domains that are missing from on-device training caches, we employed joint federated-centralized training. And to learn in the absence of curated labels on-device, we formulated a confidence filtering strategy based on user-feedback signals for federated distillation. These techniques created models that significantly improved quality metrics in offline evaluations and user-experience metrics in live A/B experiments.  ( 2 min )
    Capturing Shape Information with Multi-Scale Topological Loss Terms for 3D Reconstruction. (arXiv:2203.01703v2 [cs.CV] UPDATED)
    Reconstructing 3D objects from 2D images is both challenging for our brains and machine learning algorithms. To support this spatial reasoning task, contextual information about the overall shape of an object is critical. However, such information is not captured by established loss terms (e.g. Dice loss). We propose to complement geometrical shape information by including multi-scale topological features, such as connected components, cycles, and voids, in the reconstruction loss. Our method uses cubical complexes to calculate topological features of 3D volume data and employs an optimal transport distance to guide the reconstruction process. This topology-aware loss is fully differentiable, computationally efficient, and can be added to any neural network. We demonstrate the utility of our loss by incorporating it into SHAPR, a model for predicting the 3D cell shape of individual cells based on 2D microscopy images. Using a hybrid loss that leverages both geometrical and topological information of single objects to assess their shape, we find that topological information substantially improves the quality of reconstructions, thus highlighting its ability to extract more relevant features from image datasets.  ( 3 min )
    Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. (arXiv:2206.15462v1 [cs.CV])
    We propose a margin-based loss for vision-language model pretraining that encourages gradient-based explanations that are consistent with region-level annotations. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding performance compared to models that rely instead on region-level annotations for explicitly training an object detector such as Faster R-CNN. AMC works by encouraging gradient-based explanation masks that focus their attention scores mostly within annotated regions of interest for images that contain such annotations. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.59% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.48% when compared to the best previous model. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension and offers the added benefit by design of gradient-based explanations that better align with human annotations.  ( 2 min )
    Challenges and Opportunities in Multi-device Speech Processing. (arXiv:2206.15432v1 [eess.AS])
    We review current solutions and technical challenges for automatic speech recognition, keyword spotting, device arbitration, speech enhancement, and source localization in multidevice home environments to provide context for the INTERSPEECH 2022 special session, "Challenges and opportunities for signal processing and machine learning for multiple smart devices". We also identify the datasets needed to support these research areas. Based on the review and our research experience in the multi-device domain, we conclude with an outlook on the future evolution  ( 2 min )
    Denoised MDPs: Learning World Models Better Than the World Itself. (arXiv:2206.15477v1 [cs.LG])
    The ability to separate signal from noise, and reason with clean abstractions, is critical to intelligence. With this ability, humans can efficiently perform real world tasks without considering all possible nuisance factors.How can artificial agents do the same? What kind of information can agents safely discard as noises? In this work, we categorize information out in the wild into four types based on controllability and relation with reward, and formulate useful information as that which is both controllable and reward-relevant. This framework clarifies the kinds information removed by various prior work on representation learning in reinforcement learning (RL), and leads to our proposed approach of learning a Denoised MDP that explicitly factors out certain noise distractors. Extensive experiments on variants of DeepMind Control Suite and RoboDesk demonstrate superior performance of our denoised world model over using raw observations alone, and over prior works, across policy optimization control tasks as well as the non-control task of joint position regression.  ( 2 min )
    Causal Machine Learning: A Survey and Open Problems. (arXiv:2206.15475v1 [cs.LG])
    Causal Machine Learning (CausalML) is an umbrella term for machine learning methods that formalize the data-generation process as a structural causal model (SCM). This allows one to reason about the effects of changes to this process (i.e., interventions) and what would have happened in hindsight (i.e., counterfactuals). We categorize work in \causalml into five groups according to the problems they tackle: (1) causal supervised learning, (2) causal generative modeling, (3) causal explanations, (4) causal fairness, (5) causal reinforcement learning. For each category, we systematically compare its methods and point out open problems. Further, we review modality-specific applications in computer vision, natural language processing, and graph representation learning. Finally, we provide an overview of causal benchmarks and a critical discussion of the state of this nascent field, including recommendations for future work.  ( 2 min )
    Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge. (arXiv:2203.14416v2 [eess.AS] UPDATED)
    Text-to-Speech (TTS) services that run on edge devices have many advantages compared to cloud TTS, e.g., latency and privacy issues. However, neural vocoders with a low complexity and small model footprint inevitably generate annoying sounds. This study proposes a Bunched LPCNet2, an improved LPCNet architecture that provides highly efficient performance in high-quality for cloud servers and in a low-complexity for low-resource edge devices. Single logistic distribution achieves computational efficiency, and insightful tricks reduce the model footprint while maintaining speech quality. A DualRate architecture, which generates a lower sampling rate from a prosody model, is also proposed to reduce maintenance costs. The experiments demonstrate that Bunched LPCNet2 generates satisfactory speech quality with a model footprint of 1.1MB while operating faster than real-time on a RPi 3B. Our audio samples are available at https://srtts.github.io/bunchedLPCNet2.  ( 2 min )
    Tuning Particle Accelerators with Safety Constraints using Bayesian Optimization. (arXiv:2203.13968v3 [physics.acc-ph] UPDATED)
    Tuning machine parameters of particle accelerators is a repetitive and time-consuming task that is challenging to automate. While many off-the-shelf optimization algorithms are available, in practice their use is limited because most methods do not account for safety-critical constraints in each iteration, such as loss signals or step-size limitations. One notable exception is safe Bayesian optimization, which is a data-driven tuning approach for global optimization with noisy feedback. We propose and evaluate a step-size limited variant of safe Bayesian optimization on two research facilities of the Paul Scherrer Institut (PSI): a) the Swiss Free Electron Laser (SwissFEL) and b) the High-Intensity Proton Accelerator (HIPA). We report promising experimental results on both machines, tuning up to 16 parameters subject to 224 constraints.  ( 2 min )
    A Deep Reinforcement Learning Blind AI in DareFightingICE. (arXiv:2205.07444v2 [cs.LG] UPDATED)
    This paper presents a deep reinforcement learning agent (AI) that uses sound as the input on the DareFightingICE platform at the DareFightingICE Competition in IEEE CoG 2022. In this work, an AI that only uses sound as the input is called blind AI. While state-of-the-art AIs rely mostly on visual or structured observations provided by their environments, learning to play games from only sound is still new and thus challenging. We propose different approaches to process audio data and use the Proximal Policy Optimization algorithm for our blind AI. We also propose to use our blind AI in evaluation of sound designs submitted to the competition and define two metrics for this task. The experimental results show the effectiveness of not only our blind AI but also the proposed two metrics.  ( 2 min )
    Watch and Match: Supercharging Imitation with Regularized Optimal Transport. (arXiv:2206.15469v1 [cs.RO])
    Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8X faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.  ( 2 min )
    Introducing Non-Linearity into Quantum Generative Models. (arXiv:2205.14506v2 [quant-ph] UPDATED)
    The evolution of an isolated quantum system is linear, and hence quantum algorithms are reversible, including those that utilize quantum circuits as generative machine learning models. However, some of the most successful classical generative models, such as those based on neural networks, involve highly non-linear and thus non-reversible dynamics. In this paper, we explore the effect of these dynamics in quantum generative modeling by introducing a model that adds non-linear activations via a neural network structure onto the standard Born Machine framework - the Quantum Neuron Born Machine (QNBM). To achieve this, we utilize a previously introduced Quantum Neuron subroutine, which is a repeat-until-success circuit with mid-circuit measurements and classical control. After introducing the QNBM, we investigate how its performance depends on network size, by training a 3-layer QNBM with 4 output neurons and various input and hidden layer sizes. We then compare our non-linear QNBM to the linear Quantum Circuit Born Machine (QCBM). We allocate similar time and memory resources to each model, such that the only major difference is the qubit overhead required by the QNBM. With gradient-based training, we show that while both models can easily learn a trivial uniform probability distribution, on a more challenging class of distributions, the QNBM achieves an almost 3x smaller error rate than a QCBM with a similar number of tunable parameters. We therefore provide evidence that suggests that non-linearity is a useful resource in quantum generative models, and we put forth the QNBM as a new model with good generative performance and potential for quantum advantage.  ( 3 min )
    QuASK -- Quantum Advantage Seeker with Kernels. (arXiv:2206.15284v1 [quant-ph])
    QuASK is a quantum machine learning software written in Python that supports researchers in designing, experimenting, and assessing different quantum and classical kernels performance. This software is package agnostic and can be integrated with all major quantum software packages (e.g. IBM Qiskit, Xanadu's Pennylane, Amazon Braket). QuASK guides the user through a simple preprocessing of input data, definition and calculation of quantum and classical kernels, either custom or pre-defined ones. From this evaluation the package provides an assessment about potential quantum advantage and prediction bounds on generalization error. Moreover, it allows for the generation of parametric quantum kernels that can be trained using gradient-descent-based optimization, grid search, or genetic algorithms. Projected quantum kernels, an effective solution to mitigate the curse of dimensionality induced by the exponential scaling dimension of large Hilbert spaces, are also calculated. QuASK can furthermore generate the observable values of a quantum model and use them to study the prediction capabilities of the quantum and classical kernels.  ( 2 min )
    Correcting Mispronunciations in Speech using Spectrogram Inpainting. (arXiv:2204.03379v2 [eess.AS] UPDATED)
    Learning a new language involves constantly comparing speech productions with reference productions from the environment. Early in speech acquisition, children make articulatory adjustments to match their caregivers' speech. Grownup learners of a language tweak their speech to match the tutor reference. This paper proposes a method to synthetically generate correct pronunciation feedback given incorrect production. Furthermore, our aim is to generate the corrected production while maintaining the speaker's original voice. The system prompts the user to pronounce a phrase. The speech is recorded, and the samples associated with the inaccurate phoneme are masked with zeros. This waveform serves as an input to a speech generator, implemented as a deep learning inpainting system with a U-net architecture, and trained to output a reconstructed speech. The training set is composed of unimpaired proper speech examples, and the generator is trained to reconstruct the original proper speech. We evaluated the performance of our system on phoneme replacement of minimal pair words of English as well as on children with pronunciation disorders. Results suggest that human listeners slightly prefer our generated speech over a smoothed replacement of the inaccurate phoneme with a production of a different speaker.  ( 3 min )
    Deep Reinforcement Learning with Swin Transformer. (arXiv:2206.15269v1 [cs.LG])
    Transformers are neural network models that utilize multiple layers of self-attention heads. Attention is implemented in transformers as the contextual embeddings of the 'key' and 'query'. Transformers allow the re-combination of attention information from different layers and the processing of all inputs at once, which are more convenient than recurrent neural networks when dealt with a large number of data. Transformers have exhibited great performances on natural language processing tasks in recent years. Meanwhile, there have been tremendous efforts to adapt transformers into other fields of machine learning, such as Swin Transformer and Decision Transformer. Swin Transformer is a promising neural network architecture that splits image pixels into small patches and applies local self-attention operations inside the (shifted) windows of fixed sizes. Decision Transformer has successfully applied transformers to off-line reinforcement learning and showed that random-walk samples from Atari games are sufficient to let an agent learn optimized behaviors. However, it is considerably more challenging to combine online reinforcement learning with transformers. In this article, we further explore the possibility of not modifying the reinforcement learning policy, but only replacing the convolutional neural network architecture with the self-attention architecture from Swin Transformer. Namely, we target at changing how an agent views the world, but not how an agent plans about the world. We conduct our experiment on 49 games in Arcade Learning Environment. The results show that using Swin Transformer in reinforcement learning achieves significantly higher evaluation scores across the majority of games in Arcade Learning Environment. Thus, we conclude that online reinforcement learning can benefit from exploiting self-attentions with spatial token embeddings.  ( 3 min )
    How to Leverage Unlabeled Data in Offline Reinforcement Learning. (arXiv:2202.01741v3 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition. In many cases, labeling large datasets with rewards may be costly, especially if those rewards must be provided by human labelers, while collecting diverse unlabeled data might be comparatively inexpensive. How can we best leverage such unlabeled data in offline RL? One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data. In this paper, we find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing both in theory and in practice, without learning any reward model at all. While this approach might seem strange (and incorrect) at first, we provide extensive theoretical and empirical analysis that illustrates how it trades off reward bias, sample complexity and distributional shift, often leading to good results. We characterize conditions under which this simple strategy is effective, and further show that extending it with a simple reweighting approach can further alleviate the bias introduced by using incorrect reward labels. Our empirical evaluation confirms these findings in simulated robotic locomotion, navigation, and manipulation settings.  ( 3 min )
    Chained Generalisation Bounds. (arXiv:2203.00977v2 [stat.ML] UPDATED)
    This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes.  ( 2 min )
    Neural Network Assisted Depth Map Packing for Compression Using Standard Hardware Video Codecs. (arXiv:2206.15183v1 [cs.MM])
    Depth maps are needed by various graphics rendering and processing operations. Depth map streaming is often necessary when such operations are performed in a distributed system and it requires in most cases fast performing compression, which is why video codecs are often used. Hardware implementations of standard video codecs enable relatively high resolution and framerate combinations, even on resource constrained devices, but unfortunately those implementations do not currently support RGB+depth extensions. However, they can be used for depth compression by first packing the depth maps into RGB or YUV frames. We investigate depth map compression using a combination of depth map packing followed by encoding with a standard video codec. We show that the precision at which depth maps are packed has a large and nontrivial impact on the resulting error caused by the combination of the packing scheme and lossy compression when bitrate is constrained. Consequently, we propose a variable precision packing scheme assisted by a neural network model that predicts the optimal precision for each depth map given a bitrate constraint. We demonstrate that the model yields near optimal predictions and that it can be integrated into a game engine with very low overhead using modern hardware.  ( 2 min )
    Classical and learned MR to pseudo-CT mappings for accurate transcranial ultrasound simulation. (arXiv:2206.15441v1 [physics.med-ph])
    Model-based treatment planning for transcranial ultrasound therapy typically involves mapping the acoustic properties of the skull from an x-ray computed tomography (CT) image of the head. Here, three methods for generating pseudo-CT images from magnetic resonance (MR) images were compared as an alternative to CT. A convolutional neural network (U-Net) was trained on paired MR-CT images to generate pseudo-CT images from either T1-weighted or zero-echo time (ZTE) MR images (denoted tCT and zCT, respectively). A direct mapping from ZTE to pseudo-CT was also implemented (denoted cCT). When comparing the pseudo-CT and ground truth CT images for the test set, the mean absolute error was 133, 83, and 145 Hounsfield units (HU) across the whole head, and 398, 222, and 336 HU within the skull for the tCT, zCT, and cCT images, respectively. Ultrasound simulations were also performed using the generated pseudo-CT images and compared to simulations based on CT. An annular array transducer was used targeting the visual or motor cortex. The mean differences in the simulated focal pressure, focal position, and focal volume were 9.9%, 1.5 mm, and 15.1% for simulations based on the tCT images, 5.7%, 0.6 mm, and 5.7% for the zCT, and 6.7%, 0.9 mm, and 12.1% for the cCT. The improved results for images mapped from ZTE highlight the advantage of using imaging sequences which improve contrast of the skull bone. Overall, these results demonstrate that acoustic simulations based on MR images can give comparable accuracy to those based on CT.  ( 3 min )
    Benchmark Dataset for Precipitation Forecasting by Post-Processing the Numerical Weather Prediction. (arXiv:2206.15241v1 [cs.LG])
    Precipitation forecasting is an important scientific challenge that has wide-reaching impacts on society. Historically, this challenge has been tackled using numerical weather prediction (NWP) models, grounded on physics-based simulations. Recently, many works have proposed an alternative approach, using end-to-end deep learning (DL) models to replace physics-based NWP. While these DL methods show improved performance and computational efficiency, they exhibit limitations in long-term forecasting and lack the explainability of NWP models. In this work, we present a hybrid NWP-DL workflow to fill the gap between standalone NWP and DL approaches. Under this workflow, the NWP output is fed into a deep model, which post-processes the data to yield a refined precipitation forecast. The deep model is trained with supervision, using Automatic Weather Station (AWS) observations as ground-truth labels. This can achieve the best of both worlds, and can even benefit from future improvements in NWP technology. To facilitate study in this direction, we present a novel dataset focused on the Korean Peninsula, termed KoMet (Korea Meteorological Dataset), comprised of NWP predictions and AWS observations. For NWP, we use the Global Data Assimilation and Prediction Systems-Korea Integrated Model (GDAPS-KIM).  ( 2 min )
    Learning Functions on Multiple Sets using Multi-Set Transformers. (arXiv:2206.15444v1 [cs.LG])
    We propose a general deep architecture for learning functions on multiple permutation-invariant sets. We also show how to generalize this architecture to sets of elements of any dimension by dimension equivariance. We demonstrate that our architecture is a universal approximator of these functions, and show superior results to existing methods on a variety of tasks including counting tasks, alignment tasks, distinguishability tasks and statistical distance measurements. This last task is quite important in Machine Learning. Although our approach is quite general, we demonstrate that it can generate approximate estimates of KL divergence and mutual information that are more accurate than previous techniques that are specifically designed to approximate those statistical distances.  ( 2 min )
    An Intermediate-level Attack Framework on The Basis of Linear Regression. (arXiv:2203.10723v2 [cs.CV] UPDATED)
    This paper substantially extends our work published at ECCV, in which an intermediate-level attack was proposed to improve the transferability of some baseline adversarial examples. Specifically, we advocate a framework in which a direct linear mapping from the intermediate-level discrepancies (between adversarial features and benign features) to prediction loss of the adversarial example is established. By delving deep into the core components of such a framework, we show that 1) a variety of linear regression models can all be considered in order to establish the mapping, 2) the magnitude of the finally obtained intermediate-level adversarial discrepancy is correlated with the transferability, 3) further boost of the performance can be achieved by performing multiple runs of the baseline attack with random initialization. In addition, by leveraging these findings, we achieve new state-of-the-arts on transfer-based $\ell_\infty$ and $\ell_2$ attacks. Our code is publicly available at https://github.com/qizhangli/ila-plus-plus-lr.  ( 2 min )
    The maximum capability of a topological feature in link prediction. (arXiv:2206.15101v1 [physics.soc-ph])
    Link prediction aims to predict links of a network that are not directly visible, with profound applications in biological and social systems. Despite intensive utilization of the topological feature in this task, it is unclear to what extent a particular feature can be leveraged to infer missing links. Here, we show that the maximum capability of a topological feature follows a simple mathematical expression, which is independent of how an index gauges the feature. Hence, a family of indexes associated with one topological feature shares the same performance limit. A feature's capability is lifted in the supervised prediction, which in general gives rise to better results compared with unsupervised prediction. The universality of the pattern uncovered is empirically verified by 550 structurally diverse networks, which can be applied to feature selection and the analysis of network characteristics associated with a topological feature in link prediction.  ( 2 min )
    Online TSP with Predictions. (arXiv:2206.15364v1 [cs.DS])
    We initiate the study of online routing problems with predictions, inspired by recent exciting results in the area of learning-augmented algorithms. A learning-augmented online algorithm which incorporates predictions in a black-box manner to outperform existing algorithms if the predictions are accurate while otherwise maintaining theoretical guarantees even when the predictions are extremely erroneous is a popular framework for overcoming pessimistic worst-case competitive analysis. In this study, we particularly begin investigating the classical online traveling salesman problem (OLTSP), where future requests are augmented with predictions. Unlike the prediction models in other previous studies, each actual request in the OLTSP, associated with its arrival time and position, may not coincide with the predicted ones, which, as imagined, leads to a troublesome situation. Our main result is to study different prediction models and design algorithms to improve the best-known results in the different settings. Moreover, we generalize the proposed results to the online dial-a-ride problem.  ( 2 min )
    Invariance Properties of the Natural Gradient in Overparametrised Systems. (arXiv:2206.15273v1 [cs.LG])
    The natural gradient field is a vector field that lives on a model equipped with a distinguished Riemannian metric, e.g. the Fisher-Rao metric, and represents the direction of steepest ascent of an objective function on the model with respect to this metric. In practice, one tries to obtain the corresponding direction on the parameter space by multiplying the ordinary gradient by the inverse of the Gram matrix associated with the metric. We refer to this vector on the parameter space as the natural parameter gradient. In this paper we study when the pushforward of the natural parameter gradient is equal to the natural gradient. Furthermore we investigate the invariance properties of the natural parameter gradient. Both questions are addressed in an overparametrised setting.  ( 2 min )
    Revisiting Competitive Coding Approach for Palmprint Recognition: A Linear Discriminant Analysis Perspective. (arXiv:2206.15349v1 [cs.CV])
    The competitive Coding approach (CompCode) is one of the most promising methods for palmprint recognition. Due to its high performance and simple formulation, it has been continuously studied for many years. However, although numerous variations of CompCode have been proposed, a detailed analysis of the method is still absent. In this paper, we provide a detailed analysis of CompCode from the perspective of linear discriminant analysis (LDA) for the first time. A non-trivial sufficient condition under which the CompCode is optimal in the sense of Fisher's criterion is presented. Based on our analysis, we examined the statistics of palmprints and concluded that CompCode deviates from the optimal condition. To mitigate the deviation, we propose a new method called Class-Specific CompCode that improves CompCode by excluding non-palm-line areas from matching. A nonlinear mapping of the competitive code is also applied in this method to further enhance accuracy. Experiments on two public databases demonstrate the effectiveness of the proposed method.  ( 2 min )
    Interpretability, Then What? Editing Machine Learning Models to Reflect Human Knowledge and Values. (arXiv:2206.15465v1 [cs.LG])
    Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions--potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive system to help domain experts and data scientists easily and responsibly edit Generalized Additive Models (GAMs) and fix problematic patterns. With novel interaction techniques, our tool puts interpretability into action--empowering users to analyze, validate, and align model behaviors with their knowledge and values. Physicians have started to use our tool to investigate and fix pneumonia and sepsis risk prediction models, and an evaluation with 7 data scientists working in diverse domains highlights that our tool is easy to use, meets their model editing needs, and fits into their current workflows. Built with modern web technologies, our tool runs locally in users' web browsers or computational notebooks, lowering the barrier to use. GAM Changer is available at the following public demo link: https://interpret.ml/gam-changer.  ( 3 min )
    Physics-informed machine learning for Structural Health Monitoring. (arXiv:2206.15303v1 [cs.LG])
    The use of machine learning in Structural Health Monitoring is becoming more common, as many of the inherent tasks (such as regression and classification) in developing condition-based assessment fall naturally into its remit. This chapter introduces the concept of physics-informed machine learning, where one adapts ML algorithms to account for the physical insight an engineer will often have of the structure they are attempting to model or assess. The chapter will demonstrate how grey-box models, that combine simple physics-based models with data-driven ones, can improve predictive capability in an SHM setting. A particular strength of the approach demonstrated here is the capacity of the models to generalise, with enhanced predictive capability in different regimes. This is a key issue when life-time assessment is a requirement, or when monitoring data do not span the operational conditions a structure will undergo. The chapter will provide an overview of physics-informed ML, introducing a number of new approaches for grey-box modelling in a Bayesian setting. The main ML tool discussed will be Gaussian process regression, we will demonstrate how physical assumptions/models can be incorporated through constraints, through the mean function and kernel design, and finally in a state-space setting. A range of SHM applications will be demonstrated, from loads monitoring tasks for off-shore and aerospace structures, through to performance monitoring for long-span bridges.  ( 3 min )
    Learning Citywide Patterns of Life from Trajectory Monitoring. (arXiv:2206.15352v1 [cs.LG])
    The recent proliferation of real-world human mobility datasets has catalyzed geospatial and transportation research in trajectory prediction, demand forecasting, travel time estimation, and anomaly detection. However, these datasets also enable, more broadly, a descriptive analysis of intricate systems of human mobility. We formally define patterns of life analysis as a natural, explainable extension of online unsupervised anomaly detection, where we not only monitor a data stream for anomalies but also explicitly extract normal patterns over time. To learn patterns of life, we adapt Grow When Required (GWR) episodic memory from research in computational biology and neurorobotics to a new domain of geospatial analysis. This biologically-inspired neural network, related to self-organizing maps (SOM), constructs a set of "memories" or prototype traffic patterns incrementally as it iterates over the GPS stream. It then compares each new observation to its prior experiences, inducing an online, unsupervised clustering and anomaly detection on the data. We mine patterns-of-interest from the Porto taxi dataset, including both major public holidays and newly-discovered transportation anomalies, such as festivals and concerts which, to our knowledge, have not been previously acknowledged or reported in prior work. We anticipate that the capability to incrementally learn normal and abnormal road transportation behavior will be useful in many domains, including smart cities, autonomous vehicles, and urban planning and management.  ( 3 min )
    When an Active Learner Meets a Black-box Teacher. (arXiv:2206.15205v1 [cs.LG])
    Active learning maximizes the hypothesis updates to find those desired unlabeled data. An inherent assumption is that this learning manner can derive those updates into the optimal hypothesis. However, its convergence may not be guaranteed well if those incremental updates are negative and disordered. In this paper, we introduce a machine teacher who provides a black-box teaching hypothesis for an active learner, where the teaching hypothesis is an effective approximation for the optimal hypothesis. Theoretically, we prove that, under the guidance of this teaching hypothesis, the learner can converge into a tighter generalization error and label complexity bound than those non-educated learners who do not receive any guidance from a teacher. We further consider two teaching scenarios: teaching a white-box and black-box learner, where self-improvement of teaching is firstly proposed to improve the teaching performance. Experiments verify this idea and show better performance than the fundamental active learning strategies, such as IWAL, IWAL-D, etc.  ( 2 min )
    DESTA: A Framework for Safe Reinforcement Learning with Markov Games of Intervention. (arXiv:2110.14468v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) involves performing exploratory actions in an unknown system. This can place a learning agent in dangerous and potentially catastrophic system states. Current approaches for tackling safe learning in RL simultaneously trade-off safe exploration and task fulfillment. In this paper, we introduce a new generation of RL solvers that learn to minimise safety violations while maximising the task reward to the extent that can be tolerated by the safe policy. Our approach introduces a novel two-player framework for safe RL called Distributive Exploration Safety Training Algorithm (DESTA). The core of DESTA is a game between two adaptive agents: Safety Agent that is delegated the task of minimising safety violations and Task Agent whose goal is to maximise the environment reward. Specifically, Safety Agent can selectively take control of the system at any given point to prevent safety violations while Task Agent is free to execute its policy at any other states. This framework enables Safety Agent to learn to take actions at certain states that minimise future safety violations, both during training and testing time, while Task Agent performs actions that maximise the task performance everywhere else. Theoretically, we prove that DESTA converges to stable points enabling safety violations of pretrained policies to be minimised. Empirically, we show DESTA's ability to augment the safety of existing policies and secondly, construct safe RL policies when the Task Agent and Safety Agent are trained concurrently. We demonstrate DESTA's superior performance against leading RL methods in Lunar Lander and Frozen Lake from OpenAI gym.  ( 3 min )
    Privacy-preserving household load forecasting based on non-intrusive load monitoring: A federated deep learning approach. (arXiv:2206.15192v1 [cs.LG])
    Load forecasting is very essential in the analysis and grid planning of power systems. For this reason, we first propose a household load forecasting method based on federated deep learning and non-intrusive load monitoring (NILM). For all we know, this is the first research on federated learning (FL) in household load forecasting based on NILM. In this method, the integrated power is decomposed into individual device power by non-intrusive load monitoring, and the power of individual appliances is predicted separately using a federated deep learning model. Finally, the predicted power values of individual appliances are aggregated to form the total power prediction. Specifically, by separately predicting the electrical equipment to obtain the predicted power, it avoids the error caused by the strong time dependence in the power signal of a single device. And in the federated deep learning prediction model, the household owners with the power data share the parameters of the local model instead of the local power data, guaranteeing the privacy of the household user data. The case results demonstrate that the proposed approach provides a better prediction effect than the traditional methodology that directly predicts the aggregated signal as a whole. In addition, experiments in various federated learning environments are designed and implemented to validate the validity of this methodology.  ( 3 min )
    Prediction of Dilatory Behavior in eLearning: A Comparison of Multiple Machine Learning Models. (arXiv:2206.15079v1 [stat.ML])
    Procrastination, the irrational delay of tasks, is a common occurrence in online learning. Potential negative consequences include higher risk of drop-outs, increased stress, and reduced mood. Due to the rise of learning management systems and learning analytics, indicators of such behavior can be detected, enabling predictions of future procrastination and other dilatory behavior. However, research focusing on such predictions is scarce. Moreover, studies involving different types of predictors and comparisons between the predictive performance of various methods are virtually non-existent. In this study, we aim to fill these research gaps by analyzing the performance of multiple machine learning algorithms when predicting the delayed or timely submission of online assignments in a higher education setting with two categories of predictors: subjective, questionnaire-based variables and objective, log-data based indicators extracted from a learning management system. The results show that models with objective predictors consistently outperform models with subjective predictors, and a combination of both variable types perform slightly better. For each of these three options, a different approach prevailed (Gradient Boosting Machines for the subjective, Bayesian multilevel models for the objective, and Random Forest for the combined predictors). We conclude that careful attention should be paid to the selection of predictors and algorithms before implementing such models in learning management systems.  ( 3 min )
    Learning Iterative Reasoning through Energy Minimization. (arXiv:2206.15448v1 [cs.LG])
    Deep learning has excelled on complex pattern recognition tasks such as image classification and object recognition. However, it struggles with tasks requiring nontrivial reasoning, such as algorithmic computation. Humans are able to solve such tasks through iterative reasoning -- spending more time thinking about harder tasks. Most existing neural networks, however, exhibit a fixed computational budget controlled by the neural network architecture, preventing additional computational processing on harder tasks. In this work, we present a new framework for iterative reasoning with neural networks. We train a neural network to parameterize an energy landscape over all outputs, and implement each step of the iterative reasoning as an energy minimization step to find a minimal energy solution. By formulating reasoning as an energy minimization problem, for harder problems that lead to more complex energy landscapes, we may then adjust our underlying computational budget by running a more complex optimization procedure. We empirically illustrate that our iterative reasoning approach can solve more accurate and generalizable algorithmic reasoning tasks in both graph and continuous domains. Finally, we illustrate that our approach can recursively solve algorithmic problems requiring nested reasoning  ( 2 min )
    Understanding Instance-Level Impact of Fairness Constraints. (arXiv:2206.15437v1 [cs.LG])
    A variety of fairness constraints have been proposed in the literature to mitigate group-level statistical bias. Their impacts have been largely evaluated for different groups of populations corresponding to a set of sensitive attributes, such as race or gender. Nonetheless, the community has not observed sufficient explorations for how imposing fairness constraints fare at an instance level. Building on the concept of influence function, a measure that characterizes the impact of a training example on the target model and its predictive performance, this work studies the influence of training examples when fairness constraints are imposed. We find out that under certain assumptions, the influence function with respect to fairness constraints can be decomposed into a kernelized combination of training examples. One promising application of the proposed fairness influence function is to identify suspicious training examples that may cause model discrimination by ranking their influence scores. We demonstrate with extensive experiments that training on a subset of weighty data examples leads to lower fairness violations with a trade-off of accuracy.  ( 2 min )
    Learning Nonparametric Ordinary differential Equations: Application to Sparse and Noisy Data. (arXiv:2206.15215v1 [stat.ML])
    Learning nonparametric systems of Ordinary Differential Equations (ODEs) $\dot x = f(t,x)$ from noisy and sparse data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for $f$ for which the solution of the ODE exists and is unique. Learning $f$ consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the $L^2$ distance between $x$ and its estimator. Experiments are provided for the FitzHugh Nagumo oscillator and for the prediction of the Amyloid level in the cortex of aging subjects. In both cases, we show competitive results when compared with the state of the art.  ( 2 min )
    Interpretable Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory Models. (arXiv:2206.15316v1 [cs.LG])
    We propose a novel anomaly detection method for echocardiogram videos. The introduced method takes advantage of the periodic nature of the heart cycle to learn different variants of a variational latent trajectory model (TVAE). The models are trained on the healthy samples of an in-house dataset of infant echocardiogram videos consisting of multiple chamber views to learn a normative prior of the healthy population. During inference, maximum a posteriori (MAP) based anomaly detection is performed to detect out-of-distribution samples in our dataset. The proposed method reliably identifies severe congenital heart defects, such as Ebstein's Anomaly or Shonecomplex. Moreover, it achieves superior performance over MAP-based anomaly detection with standard variational autoencoders on the task of detecting pulmonary hypertension and right ventricular dilation. Finally, we demonstrate that the proposed method provides interpretable explanations of its output through heatmaps which highlight the regions corresponding to anomalous heart structures.  ( 2 min )
    R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS. (arXiv:2206.15276v1 [cs.SD])
    This paper introduces R-MelNet, a two-part autoregressive architecture with a frontend based on the first tier of MelNet and a backend WaveRNN-style audio decoder for neural text-to-speech synthesis. Taking as input a mixed sequence of characters and phonemes, with an optional audio priming sequence, this model produces low-resolution mel-spectral features which are interpolated and used by a WaveRNN decoder to produce an audio waveform. Coupled with half precision training, R-MelNet uses under 11 gigabytes of GPU memory on a single commodity GPU (NVIDIA 2080Ti). We detail a number of critical implementation details for stable half precision training, including an approximate, numerically stable mixture of logistics attention. Using a stochastic, multi-sample per step inference scheme, the resulting model generates highly varied audio, while enabling text and audio based controls to modify output waveforms. Qualitative and quantitative evaluations of an R-MelNet system trained on a single speaker TTS dataset demonstrate the effectiveness of our approach.  ( 2 min )
    Classification of network topology and dynamics via sequence characterization. (arXiv:2206.15190v1 [cs.SI])
    Sequences arise in many real-world scenarios; thus, identifying the mechanisms behind symbol generation is essential to understanding many complex systems. This paper analyzes sequences generated by agents walking on a networked topology. Given that in many real scenarios, the underlying processes generating the sequence is hidden, we investigate whether the reconstruction of the network via the co-occurrence method is useful to recover both the network topology and agent dynamics generating sequences. We found that the characterization of reconstructed networks provides valuable information regarding the process and topology used to create the sequences. In a machine learning approach considering 16 combinations of network topology and agent dynamics as classes, we obtained an accuracy of 87% with sequences generated with less than 40% of nodes visited. Larger sequences turned out to generate improved machine learning models. Our findings suggest that the proposed methodology could be extended to classify sequences and understand the mechanisms behind sequence generation.  ( 2 min )
    Machine learning for automated quality control in injection moulding manufacturing. (arXiv:2206.15285v1 [cs.LG])
    Machine learning (ML) may improve and automate quality control (QC) in injection moulding manufacturing. As the labelling of extensive, real-world process data is costly, however, the use of simulated process data may offer a first step towards a successful implementation. In this study, simulated data was used to develop a predictive model for the product quality of an injection moulded sorting container. The achieved accuracy, specificity and sensitivity on the test set was $99.4\%$, $99.7\%$ and $94.7\%$, respectively. This study thus shows the potential of ML towards automated QC in injection moulding and encourages the extension to ML models trained on real-world data.  ( 2 min )
    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection. (arXiv:2206.15476v1 [cs.LG])
    Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning, leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This kind of data meets the premise of shifting the input distribution: it covers a large time span ($10$ years), with naturally occurring changes over time (\eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models (MLM to classical Isolation Forest). Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical IID training (by up to $3\%$, on average). Dataset and code are available at https://github.com/bit-ml/AnoShift/.  ( 2 min )
    Towards out of distribution generalization for problems in mechanics. (arXiv:2206.14917v1 [stat.ML])
    There has been a massive increase in research interest towards applying data driven methods to problems in mechanics. While traditional machine learning (ML) methods have enabled many breakthroughs, they rely on the assumption that the training (observed) data and testing (unseen) data are independent and identically distributed (i.i.d). Thus, traditional ML approaches often break down when applied to real world mechanics problems with unknown test environments and data distribution shifts. In contrast, out-of-distribution (OOD) generalization assumes that the test data may shift (i.e., violate the i.i.d. assumption). To date, multiple methods have been proposed to improve the OOD generalization of ML methods. However, because of the lack of benchmark datasets for OOD regression problems, the efficiency of these OOD methods on regression problems, which dominate the mechanics field, remains unknown. To address this, we investigate the performance of OOD generalization methods for regression problems in mechanics. Specifically, we identify three OOD problems: covariate shift, mechanism shift, and sampling bias. For each problem, we create two benchmark examples that extend the Mechanical MNIST dataset collection, and we investigate the performance of popular OOD generalization methods on these mechanics-specific regression problems. Our numerical experiments show that in most cases, while the OOD generalization algorithms perform better compared to traditional ML methods on these OOD problems, there is a compelling need to develop more robust OOD generalization methods that are effective across multiple OOD scenarios. Overall, we expect that this study, as well as the associated open access benchmark datasets, will enable further development of OOD generalization methods for mechanics specific regression problems.  ( 3 min )
    Using Person Embedding to Enrich Features and Data Augmentation for Classification. (arXiv:2206.15162v1 [cs.LG])
    Today, machine learning is applied in almost any field. In machine learning, where there are numerous methods, classification is one of the most basic and crucial ones. Various problems can be solved by classification. The feature selection for model setup is extremely important, and producing new features via feature engineering also has a vital place in the success of the model. In our study, fraud detection classification models are built on a labeled and imbalanced dataset as a case-study. Although it is a natural language processing method, a customer space has been created with word embedding, which has been used in different areas, especially for recommender systems. The customer vectors in the created space are fed to the classification model as a feature. Moreover, to increase the number of positive labels, rows with similar characteristics are re-labeled as positive by using customer similarity determined by embedding. The model in which embedding methods are included in the classification, which provides a better representation of customers, has been compared with other models. Considering the results, it is observed that the customer embedding method had a positive effect on the success of the classification models.  ( 2 min )
    Out-of-Distribution Detection for Long-tailed and Fine-grained Skin Lesion Images. (arXiv:2206.15186v1 [cs.CV])
    Recent years have witnessed a rapid development of automated methods for skin lesion diagnosis and classification. Due to an increasing deployment of such systems in clinics, it has become important to develop a more robust system towards various Out-of-Distribution(OOD) samples (unknown skin lesions and conditions). However, the current deep learning models trained for skin lesion classification tend to classify these OOD samples incorrectly into one of their learned skin lesion categories. To address this issue, we propose a simple yet strategic approach that improves the OOD detection performance while maintaining the multi-class classification accuracy for the known categories of skin lesion. To specify, this approach is built upon a realistic scenario of a long-tailed and fine-grained OOD detection task for skin lesion images. Through this approach, 1) First, we target the mixup amongst middle and tail classes to address the long-tail problem. 2) Later, we combine the above mixup strategy with prototype learning to address the fine-grained nature of the dataset. The unique contribution of this paper is two-fold, justified by extensive experiments. First, we present a realistic problem setting of OOD task for skin lesion. Second, we propose an approach to target the long-tailed and fine-grained aspects of the problem setting simultaneously to increase the OOD performance.  ( 3 min )
    GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language. (arXiv:2206.15007v1 [cs.CL])
    Helping end users comprehend the abstract distribution shifts can greatly facilitate AI deployment. Motivated by this, we propose a novel task, dataset explanation. Given two image data sets, dataset explanation aims to automatically point out their dataset-level distribution shifts with natural language. Current techniques for monitoring distribution shifts provide inadequate information to understand datasets with the goal of improving data quality. Therefore, we introduce GSCLIP, a training-free framework to solve the dataset explanation task. In GSCLIP, we propose the selector as the first quantitative evaluation method to identify explanations that are proper to summarize dataset shifts. Furthermore, we leverage this selector to demonstrate the superiority of a generator based on language model generation. Systematic evaluation on natural data shift verifies that GSCLIP, a combined system of a hybrid generator group and an efficient selector is not only easy-to-use but also powerful for dataset explanation at scale.  ( 2 min )
    Bridging Mean-Field Games and Normalizing Flows with Trajectory Regularization. (arXiv:2206.14990v1 [math.OC])
    Mean-field games (MFGs) are a modeling framework for systems with a large number of interacting agents. They have applications in economics, finance, and game theory. Normalizing flows (NFs) are a family of deep generative models that compute data likelihoods by using an invertible mapping, which is typically parameterized by using neural networks. They are useful for density modeling and data generation. While active research has been conducted on both models, few noted the relationship between the two. In this work, we unravel the connections between MFGs and NFs by contextualizing the training of an NF as solving the MFG. This is achieved by reformulating the MFG problem in terms of agent trajectories and parameterizing a discretization of the resulting MFG with flow architectures. With this connection, we explore two research directions. First, we employ expressive NF architectures to accurately solve high-dimensional MFGs, sidestepping the curse of dimensionality in traditional numerical methods. Compared with other deep learning approaches, our trajectory-based formulation encodes the continuity equation in the neural network, resulting in a better approximation of the population dynamics. Second, we regularize the training of NFs with transport costs and show the effectiveness on controlling the model's Lipschitz bound, resulting in better generalization performance. We demonstrate numerical results through comprehensive experiments on a variety of synthetic and real-life datasets.  ( 3 min )
    Personalized Detection of Cognitive Biases in Actions of Users from Their Logs: Anchoring and Recency Biases. (arXiv:2206.15129v1 [cs.AI])
    Cognitive biases are mental shortcuts humans use in dealing with information and the environment, and which result in biased actions and behaviors (or, actions), unbeknownst to themselves. Biases take many forms, with cognitive biases occupying a central role that inflicts fairness, accountability, transparency, ethics, law, medicine, and discrimination. Detection of biases is considered a necessary step toward their mitigation. Herein, we focus on two cognitive biases - anchoring and recency. The recognition of cognitive bias in computer science is largely in the domain of information retrieval, and bias is identified at an aggregate level with the help of annotated data. Proposing a different direction for bias detection, we offer a principled approach along with Machine Learning to detect these two cognitive biases from Web logs of users' actions. Our individual user level detection makes it truly personalized, and does not rely on annotated data. Instead, we start with two basic principles established in cognitive psychology, use modified training of an attention network, and interpret attention weights in a novel way according to those principles, to infer and distinguish between these two biases. The personalized approach allows detection for specific users who are susceptible to these biases when performing their tasks, and can help build awareness among them so as to undertake bias mitigation.  ( 3 min )
    Leveraging Joint-Diagonalization in Transform-Learning NMF. (arXiv:2112.05664v2 [cs.LG] UPDATED)
    Non-negative matrix factorization with transform learning (TL-NMF) is a recent idea that aims at learning data representations suited to NMF. In this work, we relate TL-NMF to the classical matrix joint-diagonalization (JD) problem. We show that, when the number of data realizations is sufficiently large, TL-NMF can be replaced by a two-step approach -- termed as JD+NMF -- that estimates the transform through JD, prior to NMF computation. In contrast, we found that when the number of data realizations is limited, not only is JD+NMF no longer equivalent to TL-NMF, but the inherent low-rank constraint of TL-NMF turns out to be an essential ingredient to learn meaningful transforms for NMF.  ( 2 min )
    A note on Linear Bottleneck networks and their Transition to Multilinearity. (arXiv:2206.15058v1 [cs.LG])
    Randomly initialized wide neural networks transition to linear functions of weights as the width grows, in a ball of radius $O(1)$ around initialization. A necessary condition for this result is that all layers of the network are wide enough, i.e., all widths tend to infinity. However, the transition to linearity breaks down when this infinite width assumption is violated. In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization. In general, for $B-1$ bottleneck layers, the network is a degree $B$ multilinear function of weights. Importantly, the degree only depends on the number of bottlenecks and not the total depth of the network.  ( 2 min )
    FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition. (arXiv:2206.15056v1 [cs.SD])
    Self-supervised learning representations (SSLR) have resulted in robust features for downstream tasks in many fields. Recently, several SSLRs have shown promising results on automatic speech recognition (ASR) benchmark corpora. However, previous studies have only shown performance for solitary SSLRs as an input feature for ASR models. In this study, we propose to investigate the effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models. In addition, we will show there are correlations between these extracted SSLRs. As such, we further propose a feature refinement loss for decorrelation to efficiently combine the set of input features. For evaluation, we show that the proposed 'FeaRLESS learning features' perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.  ( 2 min )
    ZeroC: A Neuro-Symbolic Model for Zero-shot Concept Recognition and Acquisition at Inference Time. (arXiv:2206.15049v1 [cs.LG])
    Humans have the remarkable ability to recognize and acquire novel visual concepts in a zero-shot manner. Given a high-level, symbolic description of a novel concept in terms of previously learned visual concepts and their relations, humans can recognize novel concepts without seeing any examples. Moreover, they can acquire new concepts by parsing and communicating symbolic structures using learned visual concepts and relations. Endowing these capabilities in machines is pivotal in improving their generalization capability at inference time. In this work, we introduce Zero-shot Concept Recognition and Acquisition (ZeroC), a neuro-symbolic architecture that can recognize and acquire novel concepts in a zero-shot way. ZeroC represents concepts as graphs of constituent concept models (as nodes) and their relations (as edges). To allow inference time composition, we employ energy-based models (EBMs) to model concepts and relations. We design ZeroC architecture so that it allows a one-to-one mapping between a symbolic graph structure of a concept and its corresponding EBM, which for the first time, allows acquiring new concepts, communicating its graph structure, and applying it to classification and detection tasks (even across domains) at inference time. We introduce algorithms for learning and inference with ZeroC. We evaluate ZeroC on a challenging grid-world dataset which is designed to probe zero-shot concept recognition and acquisition, and demonstrate its capability.  ( 3 min )
    Investigating classification learning curves for automatically generated and labelled plant images. (arXiv:2205.10955v3 [cs.LG] UPDATED)
    In the context of supervised machine learning a learning curve describes how a model's performance on unseen data relates to the amount of samples used to train the model. In this paper we present a dataset of plant images with representatives of crops and weeds common to the Manitoba prairies at different growth stages. We determine the learning curve for a classification task on this data with the ResNet architecture. Our results are in accordance with previous studies and add to the evidence that learning curves are governed by power-law relationships over large scales, applications, and models. We further investigate how label noise and the reduction of trainable parameters impacts the learning curve on this dataset. Both effects lead to the model requiring disproportionally larger training sets to achieve the same classification performance as observed without these effects.  ( 2 min )
    ComDensE : Combined Dense Embedding of Relation-aware and Common Features for Knowledge Graph Completion. (arXiv:2206.14925v1 [cs.AI])
    Real-world knowledge graphs (KG) are mostly incomplete. The problem of recovering missing relations, called KG completion, has recently become an active research area. Knowledge graph (KG) embedding, a low-dimensional representation of entities and relations, is the crucial technique for KG completion. Convolutional neural networks in models such as ConvE, SACN, InteractE, and RGCN achieve recent successes. This paper takes a different architectural view and proposes ComDensE which combines relation-aware and common features using dense neural networks. In the relation-aware feature extraction, we attempt to create relational inductive bias by applying an encoding function specific to each relation. In the common feature extraction, we apply the common encoding function to all input embeddings. These encoding functions are implemented using dense layers in ComDensE. ComDensE achieves the state-of-the-art performance in the link prediction in terms of MRR, HIT@1 on FB15k-237 and HIT@1 on WN18RR compared to the previous baseline approaches. We conduct an extensive ablation study to examine the effects of the relation-aware layer and the common layer of the ComDensE. Experimental results illustrate that the combined dense architecture as implemented in ComDensE achieves the best performance.  ( 2 min )
    Teach me how to Interpolate a Myriad of Embeddings. (arXiv:2206.14868v1 [cs.LG])
    Mixup refers to interpolation-based data augmentation, originally motivated as a way to go beyond empirical risk minimization (ERM). Yet, its extensions focus on the definition of interpolation and the space where it takes place, while the augmentation itself is less studied: For a mini-batch of size $m$, most methods interpolate between $m$ pairs with a single scalar interpolation factor $\lambda$. In this work, we make progress in this direction by introducing MultiMix, which interpolates an arbitrary number $n$ of tuples, each of length $m$, with one vector $\lambda$ per tuple. On sequence data, we further extend to dense interpolation and loss computation over all spatial positions. Overall, we increase the number of tuples per mini-batch by orders of magnitude at little additional cost. This is possible by interpolating at the very last layer before the classifier. Finally, to address inconsistencies due to linear target interpolation, we introduce a self-distillation approach to generate and interpolate synthetic targets. We empirically show that our contributions result in significant improvement over state-of-the-art mixup methods on four benchmarks. By analyzing the embedding space, we observe that the classes are more tightly clustered and uniformly spread over the embedding space, thereby explaining the improved behavior.  ( 2 min )
    Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra. (arXiv:2206.15397v1 [cs.LG])
    K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very time-consuming (or even prohibitive) when these factors are large. In this paper, we theoretically show that, owing to the exponential-average construction paradigm of the Kronecker factors that is typically used, their eigen-spectrum must decay. We show numerically that in practice this decay is very rapid, leading to the idea that we could save substantial computation by only focusing on the first few eigen-modes when inverting the Kronecker-factors. Randomized Numerical Linear Algebra provides us with the necessary tools to do so. Numerical results show we obtain $\approx2.5\times$ reduction in per-epoch time and $\approx3.3\times$ reduction in time to target accuracy. We compare our proposed K-FAC sped-up versions with a more computationally efficient NG implementation, SENG, and observe we perform on par with it.  ( 2 min )
    Group-invariant tensor train networks for supervised learning. (arXiv:2206.15051v1 [cs.LG])
    Invariance has recently proven to be a powerful inductive bias in machine learning models. One such class of predictive or generative models are tensor networks. We introduce a new numerical algorithm to construct a basis of tensors that are invariant under the action of normal matrix representations of an arbitrary discrete group. This method can be up to several orders of magnitude faster than previous approaches. The group-invariant tensors are then combined into a group-invariant tensor train network, which can be used as a supervised machine learning model. We applied this model to a protein binding classification problem, taking into account problem-specific invariances, and obtained prediction accuracy in line with state-of-the-art deep learning approaches.  ( 2 min )
    Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary Differential Equations. (arXiv:2205.01897v2 [eess.AS] UPDATED)
    Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems. In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper. The proposed models achieve performance comparable to state-of-the-art recurrent neural networks (RNNs) albeit using fewer parameters. We show that this approach does not require oversampling and allows to increase the sampling rate after the training has completed, which results in increased accuracy. Using a sophisticated numerical solver allows to increase the accuracy at the cost of slower processing. ODEs learned this way do not require closed forms but are still physically interpretable.  ( 2 min )
    Pooling Revisited: Your Receptive Field is Suboptimal. (arXiv:2205.15254v2 [cs.CV] UPDATED)
    The size and shape of the receptive field determine how the network aggregates local information and affect the overall performance of a model considerably. Many components in a neural network, such as kernel sizes and strides for convolution and pooling operations, influence the configuration of a receptive field. However, they still rely on hyperparameters, and the receptive fields of existing models result in suboptimal shapes and sizes. Hence, we propose a simple yet effective Dynamically Optimized Pooling operation, referred to as DynOPool, which optimizes the scale factors of feature maps end-to-end by learning the desirable size and shape of its receptive field in each layer. Any kind of resizing modules in a deep neural network can be replaced by the operations with DynOPool at a minimal cost. Also, DynOPool controls the complexity of a model by introducing an additional loss term that constrains computational cost. Our experiments show that the models equipped with the proposed learnable resizing module outperform the baseline networks on multiple datasets in image classification and semantic segmentation.  ( 2 min )
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v2 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class -- in our case, empirical Rademacher complexity -- by how much can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and a (2,1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, orthogonal to classic compression via weight pruning.  ( 2 min )
    Deep Fusion Prior for Multi-Focus Image Super Resolution Fusion. (arXiv:2110.05706v4 [cs.CV] UPDATED)
    Multi-focus image fusion (MFIF) and super-resolution (SR) are the inverse problem of imaging model, purposes of MFIF and SR are obtaining all-in-focus and high-resolution 2D mapping of targets. Though various MFIF and SR methods have been designed; almost all the them deal with MFIF and SR separately. This paper unifies MFIF and SR problems in the physical perspective as the multi-focus image super resolution fusion (MFISRF), and we propose a novel unified dataset-free unsupervised framework named deep fusion prior (DFP) based-on deep image prior (DIP) to address such MFISRF with single model. Experiments have proved that our proposed DFP approaches or even outperforms those state-of-art MFIF and SR method combinations. To our best knowledge, our proposed work is a dataset-free unsupervised method to simultaneously implement the multi-focus fusion and super-resolution task for the first time. Additionally, DFP is a general framework, thus its networks and focus measurement tactics can be continuously updated to further improve the MFISRF performance. DFP codes are open source available at this http URL  ( 3 min )
    A Rigorous Study of Integrated Gradients Method and Extensions to Internal Neuron Attributions. (arXiv:2202.11912v2 [cs.LG] UPDATED)
    As deep learning (DL) efficacy grows, concerns for poor model explainability grow also. Attribution methods address the issue of explainability by quantifying the importance of an input feature for a model prediction. Among various methods, Integrated Gradients (IG) sets itself apart by claiming other methods failed to satisfy desirable axioms, while IG and methods like it uniquely satisfy said axioms. This paper comments on fundamental aspects of IG and its applications/extensions: 1) We identify key differences between IG function spaces and the supporting literature's function spaces which problematize previous claims of IG uniqueness. We show that with the introduction of an additional axiom, \textit{non-decreasing positivity}, the uniqueness claims can be established. 2) We address the question of input sensitivity by identifying function classes where IG is/is not Lipschitz in the attributed input. 3) We show that axioms for single-baseline methods have analogous properties for methods with probability distribution baselines. 4) We introduce a computationally efficient method of identifying internal neurons that contribute to specified regions of an IG attribution map. Finally, we present experimental results validating this method.  ( 2 min )
    SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning. (arXiv:2110.11395v2 [cs.LG] UPDATED)
    Pruning neural networks reduces inference time and memory costs. On standard hardware, these benefits will be especially prominent if coarse-grained structures, like feature maps, are pruned. We devise two novel saliency-based methods for second-order structured pruning (SOSP) which include correlations among all structures and layers. Our main method SOSP-H employs an innovative second-order approximation, which enables saliency evaluations by fast Hessian-vector products. SOSP-H thereby scales like a first-order method despite taking into account the full Hessian. We validate SOSP-H by comparing it to our second method SOSP-I that uses a well-established Hessian approximation, and to numerous state-of-the-art methods. While SOSP-H performs on par or better in terms of accuracy, it has clear advantages in terms of scalability and efficiency. This allowed us to scale SOSP-H to large-scale vision tasks, even though it captures correlations across all layers of the network. To underscore the global nature of our pruning methods, we evaluate their performance not only by removing structures from a pretrained network, but also by detecting architectural bottlenecks. We show that our algorithms allow to systematically reveal architectural bottlenecks, which we then remove to further increase the accuracy of the networks.  ( 3 min )
    Shifts 2.0: Extending The Dataset of Real Distributional Shifts. (arXiv:2206.15407v1 [cs.LG])
    Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML baseline datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. Among these benchmarks, the Shifts dataset stands out in terms of the diversity of tasks as well as the data modalities it features. While most of the benchmarks are heavily dominated by 2D image classification tasks, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables the robustness properties of models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and a strict safety requirement due to the high cost of errors. These new datasets will allow researchers to further explore robust generalization and uncertainty estimation in new situations. In this work, we provide a description of the dataset and baseline results for both tasks.  ( 3 min )
    Is Neuro-Symbolic AI Meeting its Promise in Natural Language Processing? A Structured Review. (arXiv:2202.12205v2 [cs.AI] UPDATED)
    Advocates for Neuro-Symbolic Artificial Intelligence (NeSy) assert that combining deep learning with symbolic reasoning will lead to stronger AI than either paradigm on its own. As successful as deep learning has been, it is generally accepted that even our best deep learning systems are not very good at abstract reasoning. And since reasoning is inextricably linked to language, it makes intuitive sense that Natural Language Processing (NLP), would be a particularly well-suited candidate for NeSy. We conduct a structured review of studies implementing NeSy for NLP, with the aim of answering the question of whether NeSy is indeed meeting its promises: reasoning, out-of-distribution generalization, interpretability, learning and reasoning from small data, and transferability to new domains. We examine the impact of knowledge representation, such as rules and semantic networks, language structure and relational structure, and whether implicit or explicit reasoning contributes to higher promise scores. We find that systems where logic is compiled into the neural network lead to the most NeSy goals being satisfied, while other factors such as knowledge representation, or type of neural architecture do not exhibit a clear correlation with goals being met. We find many discrepancies in how reasoning is defined, specifically in relation to human level reasoning, which impact decisions about model architectures and drive conclusions which are not always consistent across studies. Hence we advocate for a more methodical approach to the application of theories of human reasoning as well as the development of appropriate benchmarks, which we hope can lead to a better understanding of progress in the field. We make our data and code available on github for further analysis.  ( 3 min )
    Rethinking Exponential Averaging of the Fisher. (arXiv:2204.04718v2 [cs.LG] UPDATED)
    In optimization for Machine learning (ML), it is typical that curvature-matrix (CM) estimates rely on an exponential average (EA) of local estimates (giving EA-CM algorithms). This approach has little principled justification, but is very often used in practice. In this paper, we draw a connection between EA-CM algorithms and what we call a "Wake of Quadratic regularized models". The outlined connection allows us to understand what EA-CM algorithms are doing from an optimization perspective. Generalizing from the established connection, we propose a new family of algorithms, "KL-Divergence Wake-Regularized Models" (KLD-WRM). We give three different practical instantiations of KLD-WRM, and show numerically that these outperform K-FAC on MNIST.  ( 2 min )
    Contrastive Pretraining for Echocardiography Segmentation with Limited Data. (arXiv:2201.07219v2 [eess.IV] UPDATED)
    Contrastive learning has proven useful in many applications where access to labelled data is limited. The lack of annotated data is particularly problematic in medical image segmentation as it is difficult to have clinical experts manually annotate large volumes of data such as cardiac structures in ultrasound images of the heart. In this paper, we argue whether or not contrastive pretraining is helpful for the segmentation of the left ventricle in echocardiography images. Furthermore, we study the effect of contrastive pretraining on two well-known segmentation networks, UNet and DeepLabV3. Our results show that contrastive pretraining helps improve the performance on left ventricle segmentation, particularly when annotated data is scarce. We show how to achieve comparable results to state-of-the-art fully supervised algorithms when we train our models in a self-supervised fashion followed by fine-tuning on just 5\% of the data. We show that our solution outperforms what is currently published on a large public dataset (EchoNet-Dynamic) achieving a Dice score of 0.9211. We also compare the performance of our solution on another smaller dataset (CAMUS) to demonstrate the generalizability of our proposed solution. The code is available at (https://github.com/BioMedIA-MBZUAI/contrastive-echo).  ( 3 min )
    GraphFramEx: Towards Systematic Evaluation of Explainability Methods for Graph Neural Networks. (arXiv:2206.09677v2 [cs.LG] UPDATED)
    As one of the most popular machine learning models today, graph neural networks (GNNs) have attracted intense interest recently, and so does their explainability. Users are increasingly interested in a better understanding of GNN models and their outcomes. Unfortunately, today's evaluation frameworks for GNN explainability often rely on synthetic datasets, leading to conclusions of limited scope due to a lack of complexity in the problem instances. As GNN models are deployed to more mission-critical applications, we are in dire need for a common evaluation protocol of explainability methods of GNNs. In this paper, we propose, to our best knowledge, the first systematic evaluation framework for GNN explainability, considering explainability on three different "user needs:" explanation focus, mask nature, and mask transformation. We propose a unique metric that combines the fidelity measures and classify explanations based on their quality of being sufficient or necessary. We scope ourselves to node classification tasks and compare the most representative techniques in the field of input-level explainability for GNNs. For the widely used synthetic benchmarks, surprisingly shallow techniques such as personalized PageRank have the best performance for a minimum computation time. But when the graph structure is more complex and nodes have meaningful features, gradient-based methods, in particular Saliency, are the best according to our evaluation criteria. However, none dominates the others on all evaluation dimensions and there is always a trade-off. We further apply our evaluation protocol in a case study on eBay graphs to reflect the production environment.  ( 3 min )
    An Efficient Industrial Federated Learning Framework for AIoT: A Face Recognition Application. (arXiv:2206.13398v2 [cs.CV] UPDATED)
    Recently, the artificial intelligence of things (AIoT) has been gaining increasing attention, with an intriguing vision of providing highly intelligent services through the network connection of things, leading to an advanced AI-driven ecology. However, recent regulatory restrictions on data privacy preclude uploading sensitive local data to data centers and utilizing them in a centralized approach. Directly applying federated learning algorithms in this scenario could hardly meet the industrial requirements of both efficiency and accuracy. Therefore, we propose an efficient industrial federated learning framework for AIoT in terms of a face recognition application. Specifically, we propose to utilize the concept of transfer learning to speed up federated training on devices and further present a novel design of a private projector that helps protect shared gradients without incurring additional memory consumption or computational cost. Empirical studies on a private Asian face dataset show that our approach can achieve high recognition accuracy in only 20 communication rounds, demonstrating its effectiveness in prediction and its efficiency in training.  ( 2 min )
    Ensemble CNN models for Covid-19 Recognition and Severity Perdition From 3D CT-scan. (arXiv:2206.15431v1 [eess.IV])
    Since the appearance of Covid-19 in late 2019, Covid-19 has become an active research topic for the artificial intelligence (AI) community. One of the most interesting AI topics is Covid-19 analysis of medical imaging. CT-scan imaging is the most informative tool about this disease. This work is part of the 2nd COV19D competition, where two challenges are set: Covid-19 Detection and Covid-19 Severity Detection from the CT-scans. For Covid-19 detection from CT-scans, we proposed an ensemble of 2D Convolution blocks with Densenet-161 models. Here, each 2D convolutional block with Densenet-161 architecture is trained separately and in testing phase, the ensemble model is based on the average of their probabilities. On the other hand, we proposed an ensemble of Convolutional Layers with Inception models for Covid-19 severity detection. In addition to the Convolutional Layers, three Inception variants were used, namely Inception-v3, Inception-v4 and Inception-Resnet. Our proposed approaches outperformed the baseline approach in the validation data of the 2nd COV19D competition by 11% and 16% for Covid-19 detection and Covid-19 severity detection, respectively.  ( 3 min )
    Predicting Corporate Risk by Jointly Modeling Company Networks and Dialogues in Earnings Conference Calls. (arXiv:2206.06174v2 [cs.CL] UPDATED)
    Earnings conference calls are attracting an increasing number of researchers due to their free form and rich information. Existing studies, however, do not take speaker role information into account. Furthermore, current research does not fully account for the impact of inter-company relationships on company risk. The only study that integrates company networks and earnings conference calls constructs an undirected graph for companies holding earnings conference calls at different dates, failing to meet the requirement of no temporal information leakage for prediction tasks. To address the aforementioned issues, we propose a new model called Temporal Virtual Graph Neural Network (TVGNN), which incorporates earnings conference calls and company networks to predict company risk. For the first time, our model incorporates participant role information in dialogue modeling. Moreover, we develop a new approach to construct company networks that ensures no temporal information leakage in the graph. In experiments, our proposed model outperforms all baselines. The supplementary analyses demonstrate the model's effectiveness and interpretability.  ( 2 min )
    Learning two-phase microstructure evolution using neural operators and autoencoder architectures. (arXiv:2204.07230v2 [cond-mat.mtrl-sci] UPDATED)
    Phase-field modeling is an effective but computationally expensive method for capturing the mesoscale morphological and microstructure evolution in materials. Hence, fast and generalizable surrogate models are needed to alleviate the cost of computationally taxing processes such as in optimization and design of materials. The intrinsic discontinuous nature of the physical phenomena incurred by the presence of sharp phase boundaries makes the training of the surrogate model cumbersome. We develop a framework that integrates a convolutional autoencoder architecture with a deep neural operator (DeepONet) to learn the dynamic evolution of a two-phase mixture and accelerate time-to-solution in predicting the microstructure evolution. We utilize the convolutional autoencoder to provide a compact representation of the microstructure data in a low-dimensional latent space. DeepONet, which consists of two sub-networks, one for encoding the input function at a fixed number of sensors locations (branch net) and another for encoding the locations for the output functions (trunk net), learns the mesoscale dynamics of the microstructure evolution from the autoencoder latent space. The decoder part of the convolutional autoencoder then reconstructs the time-evolved microstructure from the DeepONet predictions. The trained DeepONet architecture can then be used to replace the high-fidelity phase-field numerical solver in interpolation tasks or to accelerate the numerical solver in extrapolation tasks.  ( 3 min )
    Learning Generative Factors of Neuroimaging Data with Variational auto-encoders. (arXiv:2206.01939v2 [cs.LG] UPDATED)
    Neuroimaging techniques produce high-dimensional, stochastic data from which it might be challenging to extract high-level knowledge about the phenomena of interest. We address this challenge by applying the generative modelling framework to 1) classify multiple pathologies and 2) recover the neurological mechanisms of those pathologies in a data-driven manner. Our framework learns generative factors of data related to pathologies. We provide an algorithm to decode those factors further and observe how different pathologies affect observed data. We illustrate the applicability of the proposed approach to identifying schizophrenia, either followed or not by auditory verbal hallucinations. We further demonstrate the ability of the framework to learn disease-related mechanisms consistent with current domain knowledge. We also compare the proposed framework with several benchmark approaches and indicate its classification performance and interpretability advantages.  ( 2 min )
    ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training. (arXiv:2110.05323v2 [cs.LG] UPDATED)
    Federated learning is a powerful distributed learning scheme that allows numerous edge devices to collaboratively train a model without sharing their data. However, training is resource-intensive for edge devices, and limited network bandwidth is often the main bottleneck. Prior work often overcomes the constraints by condensing the models or messages into compact formats, e.g., by gradient compression or distillation. In contrast, we propose ProgFed, the first progressive training framework for efficient and effective federated learning. It inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. We theoretically prove that ProgFed converges at the same asymptotic rate as standard training on full models. Extensive results on a broad range of architectures, including CNNs (VGG, ResNet, ConvNets) and U-nets, and diverse tasks from simple classification to medical image segmentation show that our highly effective training approach saves up to $20\%$ computation and up to $63\%$ communication costs for converged models. As our approach is also complimentary to prior work on compression, we can achieve a wide range of trade-offs by combining these techniques, showing reduced communication of up to $50\times$ at only $0.1\%$ loss in utility. Code is available at https://github.com/a514514772/ProgFed.  ( 3 min )
    A Medical Image Fusion Method based on MDLatLRRv2. (arXiv:2206.15179v1 [eess.IV])
    Since MDLatLRR only considers detailed parts (salient features) of input images extracted by latent low-rank representation (LatLRR), it doesn't use base parts (principal features) extracted by LatLRR effectively. Therefore, we proposed an improved multi-level decomposition method called MDLatLRRv2 which effectively analyzes and utilizes all the image features obtained by LatLRR. Then we apply MDLatLRRv2 to medical image fusion. The base parts are fused by average strategy and the detail parts are fused by nuclear-norm operation. The comparison with the existing methods demonstrates that the proposed method can achieve state-of-the-art fusion performance in objective and subjective assessment.  ( 2 min )
    Which Minimizer Does My Neural Network Converge To?. (arXiv:2011.02408v2 [stat.ML] UPDATED)
    The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work highlights that it induces sources of error absent from underparameterized models.  ( 2 min )
    Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load. (arXiv:2203.16637v2 [cs.SD] UPDATED)
    As a neurophysiological response to threat or adverse conditions, stress can affect cognition, emotion and behaviour with potentially detrimental effects on health in the case of sustained exposure. Since the affective content of speech is inherently modulated by an individual's physical and mental state, a substantial body of research has been devoted to the study of paralinguistic correlates of stress-inducing task load. Historically, voice stress analysis (VSA) has been conducted using conventional digital signal processing (DSP) techniques. Despite the development of modern methods based on deep neural networks (DNNs), accurately detecting stress in speech remains difficult due to the wide variety of stressors and considerable variability in the individual stress perception. To that end, we introduce a set of five datasets for task load detection in speech. The voice recordings were collected as either cognitive or physical stress was induced in the cohort of volunteers, with a cumulative number of more than a hundred speakers. We used the datasets to design and evaluate a novel self-supervised audio representation that leverages the effectiveness of handcrafted features (DSP-based) and the complexity of data-driven DNN representations. Notably, the proposed approach outperformed both extensive handcrafted feature sets and novel DNN-based audio representation learning approaches.  ( 3 min )
    Learning Task-relevant Representations for Generalization via Characteristic Functions of Reward Sequence Distributions. (arXiv:2205.10218v3 [cs.LG] UPDATED)
    Generalization across different environments with the same tasks is critical for successful applications of visual reinforcement learning (RL) in real scenarios. However, visual distractions -- which are common in real scenes -- from high-dimensional observations can be hurtful to the learned representations in visual RL, thus degrading the performance of generalization. To tackle this problem, we propose a novel approach, namely Characteristic Reward Sequence Prediction (CRESP), to extract the task-relevant information by learning reward sequence distributions (RSDs), as the reward signals are task-relevant in RL and invariant to visual distractions. Specifically, to effectively capture the task-relevant information via RSDs, CRESP introduces an auxiliary task -- that is, predicting the characteristic functions of RSDs -- to learn task-relevant representations, because we can well approximate the high-dimensional distributions by leveraging the corresponding characteristic functions. Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with different visual distractions.  ( 3 min )
    Learning Pneumatic Non-Prehensile Manipulation with a Mobile Blower. (arXiv:2204.02390v2 [cs.RO] UPDATED)
    We investigate pneumatic non-prehensile manipulation (i.e., blowing) as a means of efficiently moving scattered objects into a target receptacle. Due to the chaotic nature of aerodynamic forces, a blowing controller must (i) continually adapt to unexpected changes from its actions, (ii) maintain fine-grained control, since the slightest misstep can result in large unintended consequences (e.g., scatter objects already in a pile), and (iii) infer long-range plans (e.g., move the robot to strategic blowing locations). We tackle these challenges in the context of deep reinforcement learning, introducing a multi-frequency version of the spatial action maps framework. This allows for efficient learning of vision-based policies that effectively combine high-level planning and low-level closed-loop control for dynamic mobile manipulation. Experiments show that our system learns efficient behaviors for the task, demonstrating in particular that blowing achieves better downstream performance than pushing, and that our policies improve performance over baselines. Moreover, we show that our system naturally encourages emergent specialization between the different subpolicies spanning low-level fine-grained control and high-level planning. On a real mobile robot equipped with a miniature air blower, we show that our simulation-trained policies transfer well to a real environment and can generalize to novel objects.  ( 3 min )
    QUIDAM: A Framework for Quantization-Aware DNN Accelerator and Model Co-Exploration. (arXiv:2206.15463v1 [cs.AR])
    As the machine learning and systems communities strive to achieve higher energy-efficiency through custom deep neural network (DNN) accelerators, varied precision or quantization levels, and model compression techniques, there is a need for design space exploration frameworks that incorporate quantization-aware processing elements into the accelerator design space while having accurate and fast power, performance, and area models. In this work, we present QUIDAM, a highly parameterized quantization-aware DNN accelerator and model co-exploration framework. Our framework can facilitate future research on design space exploration of DNN accelerators for various design choices such as bit precision, processing element type, scratchpad sizes of processing elements, global buffer size, number of total processing elements, and DNN configurations. Our results show that different bit precisions and processing element types lead to significant differences in terms of performance per area and energy. Specifically, our framework identifies a wide range of design points where performance per area and energy varies more than 5x and 35x, respectively. With the proposed framework, we show that lightweight processing elements achieve on par accuracy results and up to 5.7x more performance per area and energy improvement when compared to the best INT16 based implementation. Finally, due to the efficiency of the pre-characterized power, performance, and area models, QUIDAM can speed up the design exploration process by 3-4 orders of magnitude as it removes the need for expensive synthesis and characterization of each design.  ( 3 min )
    Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding. (arXiv:2206.15427v1 [eess.AS])
    This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech (TTS) problem under the few-shot setting. Transfer learning is a common approach when it comes to few-shot learning since training from scratch on few-shot training data is bound to overfit. Still, we find that the naive transfer learning approach fails to adapt to unseen languages under extremely few-shot settings, where less than 8 minutes of data is provided. We deal with the problem by proposing a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space. Furthermore, by utilizing phoneme-level averaged self-supervised learned features, we effectively improve the quality of synthesized speeches. Experiments show that using 4 utterances, which is about 30 seconds of data, is enough to synthesize intelligible speech when adapting to an unseen language using our framework.
    Adaptive Cut Selection in Mixed-Integer Linear Programming. (arXiv:2202.10962v2 [math.OC] UPDATED)
    Cut selection is a subroutine used in all modern mixed-integer linear programming solvers with the goal of selecting a subset of generated cuts that induce optimal solver performance. These solvers have millions of parameter combinations, and so are excellent candidates for parameter tuning. Cut selection scoring rules are usually weighted sums of different measurements, where the weights are parameters. We present a parametric family of mixed-integer linear programs together with infinitely many family-wide valid cuts. Some of these cuts can induce integer optimal solutions directly after being applied, while others fail to do so even if an infinite amount are applied. We show for a specific cut selection rule, that any finite grid search of the parameter space will always miss all parameter values, which select integer optimal inducing cuts in an infinite amount of our problems. We propose a variation on the design of existing graph convolutional neural networks, adapting them to learn cut selection rule parameters. We present a reinforcement learning framework for selecting cuts, and train our design using said framework over MIPLIB 2017. Our framework and design show that adaptive cut selection does substantially improve performance over a diverse set of instances, but that finding a single function describing such a rule is difficult. Code for reproducing all experiments is available at https://github.com/Opt-Mucca/Adaptive-Cutsel-MILP.  ( 3 min )
    Model-Value Inconsistency as a Signal for Epistemic Uncertainty. (arXiv:2112.04153v3 [cs.LG] UPDATED)
    Using a model of the environment and a value function, an agent can construct many estimates of a state's value, by unrolling the model for different lengths and bootstrapping with its value function. Our key insight is that one can treat this set of value estimates as a type of ensemble, which we call an \emph{implicit value ensemble} (IVE). Consequently, the discrepancy between these estimates can be used as a proxy for the agent's epistemic uncertainty; we term this signal \emph{model-value inconsistency} or \emph{self-inconsistency} for short. Unlike prior work which estimates uncertainty by training an ensemble of many models and/or value functions, this approach requires only the single model and value function which are already being learned in most model-based reinforcement learning algorithms. We provide empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful (i) as a signal for exploration, (ii) for acting safely under distribution shifts, and (iii) for robustifying value-based planning with a learned model.  ( 3 min )
    TINC: Temporally Informed Non-Contrastive Learning for Disease Progression Modeling in Retinal OCT Volumes. (arXiv:2206.15282v1 [cs.CV])
    Recent contrastive learning methods achieved state-of-the-art in low label regimes. However, the training requires large batch sizes and heavy augmentations to create multiple views of an image. With non-contrastive methods, the negatives are implicitly incorporated in the loss, allowing different images and modalities as pairs. Although the meta-information (i.e., age, sex) in medical imaging is abundant, the annotations are noisy and prone to class imbalance. In this work, we exploited already existing temporal information (different visits from a patient) in a longitudinal optical coherence tomography (OCT) dataset using temporally informed non-contrastive loss (TINC) without increasing complexity and need for negative pairs. Moreover, our novel pair-forming scheme can avoid heavy augmentations and implicitly incorporates the temporal information in the pairs. Finally, these representations learned from the pretraining are more successful in predicting disease progression where the temporal information is crucial for the downstream task. More specifically, our model outperforms existing models in predicting the risk of conversion within a time frame from intermediate age-related macular degeneration (AMD) to the late wet-AMD stage.  ( 2 min )
    Federated Over-Air Subspace Tracking from Incomplete and Corrupted Data. (arXiv:2002.12873v4 [cs.LG] UPDATED)
    In this work we study the problem of Subspace Tracking with missing data (ST-miss) and outliers (Robust ST-miss). We propose a novel algorithm, and provide a guarantee for both these problems. Unlike past work on this topic, the current work does not impose the piecewise constant subspace change assumption. Additionally, the proposed algorithm is much simpler (uses fewer parameters) than our previous work. Secondly, we extend our approach and its analysis to provably solving these problems when the data is federated and when the over-air data communication modality is used for information exchange between the $K$ peer nodes and the center. We validate our theoretical claims with extensive numerical experiments.  ( 2 min )
    Data-Efficient Learning via Minimizing Hyperspherical Energy. (arXiv:2206.15204v1 [cs.LG])
    Deep learning on large-scale data is dominant nowadays. The unprecedented scale of data has been arguably one of the most important driving forces for the success of deep learning. However, there still exist scenarios where collecting data or labels could be extremely expensive, e.g., medical imaging and robotics. To fill up this gap, this paper considers the problem of data-efficient learning from scratch using a small amount of representative data. First, we characterize this problem by active learning on homeomorphic tubes of spherical manifolds. This naturally generates feasible hypothesis class. With homologous topological properties, we identify an important connection -- finding tube manifolds is equivalent to minimizing hyperspherical energy (MHE) in physical geometry. Inspired by this connection, we propose a MHE-based active learning (MHEAL) algorithm, and provide comprehensive theoretical guarantees for MHEAL, covering convergence and generalization analysis. Finally, we demonstrate the empirical performance of MHEAL in a wide range of applications on data-efficient learning, including deep clustering, distribution matching, version space sampling and deep active learning.  ( 2 min )
    The Topological BERT: Transforming Attention into Topology for Natural Language Processing. (arXiv:2206.15195v1 [cs.CL])
    In recent years, the introduction of the Transformer models sparked a revolution in natural language processing (NLP). BERT was one of the first text encoders using only the attention mechanism without any recurrent parts to achieve state-of-the-art results on many NLP tasks. This paper introduces a text classifier using topological data analysis. We use BERT's attention maps transformed into attention graphs as the only input to that classifier. The model can solve tasks such as distinguishing spam from ham messages, recognizing whether a sentence is grammatically correct, or evaluating a movie review as negative or positive. It performs comparably to the BERT baseline and outperforms it on some tasks. Additionally, we propose a new method to reduce the number of BERT's attention heads considered by the topological classifier, which allows us to prune the number of heads from 144 down to as few as ten with no reduction in performance. Our work also shows that the topological model displays higher robustness against adversarial attacks than the original BERT model, which is maintained during the pruning process. To the best of our knowledge, this work is the first to confront topological-based models with adversarial attacks in the context of NLP.  ( 2 min )
    Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations. (arXiv:2106.13876v3 [cs.CL] UPDATED)
    Models that generate extractive rationales (i.e., subsets of features) or natural language explanations (NLEs) for their predictions are important for explainable AI. While an extractive rationale provides a quick view of the features most responsible for a prediction, an NLE allows for a comprehensive description of the decision-making process behind a prediction. However, current models that generate the best extractive rationales or NLEs often fall behind the state-of-the-art (SOTA) in terms of task performance. In this work, we bridge this gap by introducing RExC, a self-rationalizing framework that grounds its predictions and two complementary types of explanations (NLEs and extractive rationales) in background knowledge. Our framework improves over previous methods by: (i) reaching SOTA task performance while also providing explanations, (ii) providing two types of explanations, while existing models usually provide only one type, and (iii) beating by a large margin the previous SOTA in terms of quality of both types of explanations. Furthermore, a perturbation analysis in RExC shows a high degree of association between explanations and predictions, a necessary property of faithful explanations.  ( 3 min )
    PhySRNet: Physics informed super-resolution network for application in computational solid mechanics. (arXiv:2206.15457v1 [cond-mat.mtrl-sci])
    Traditional approaches based on finite element analyses have been successfully used to predict the macro-scale behavior of heterogeneous materials (composites, multicomponent alloys, and polycrystals) widely used in industrial applications. However, this necessitates the mesh size to be smaller than the characteristic length scale of the microstructural heterogeneities in the material leading to computationally expensive and time-consuming calculations. The recent advances in deep learning based image super-resolution (SR) algorithms open up a promising avenue to tackle this computational challenge by enabling researchers to enhance the spatio-temporal resolution of data obtained from coarse mesh simulations. However, technical challenges still remain in developing a high-fidelity SR model for application to computational solid mechanics, especially for materials undergoing large deformation. This work aims at developing a physics-informed deep learning based super-resolution framework (PhySRNet) which enables reconstruction of high-resolution deformation fields (displacement and stress) from their low-resolution counterparts without requiring high-resolution labeled data. We design a synthetic case study to illustrate the effectiveness of the proposed framework and demonstrate that the super-resolved fields match the accuracy of an advanced numerical solver running at 400 times the coarse mesh resolution while simultaneously satisfying the (highly nonlinear) governing laws. The approach opens the door to applying machine learning and traditional numerical approaches in tandem to reduce computational complexity accelerate scientific discovery and engineering design.
    Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms. (arXiv:2203.02474v2 [stat.ML] UPDATED)
    Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.  ( 3 min )
    Auto Response Generation in Online Medical Chat Services. (arXiv:2104.12755v2 [cs.CL] UPDATED)
    Telehealth helps to facilitate access to medical professionals by enabling remote medical services for the patients. These services have become gradually popular over the years with the advent of necessary technological infrastructure. The benefits of telehealth have been even more apparent since the beginning of the COVID-19 crisis, as people have become less inclined to visit doctors in person during the pandemic. In this paper, we focus on facilitating the chat sessions between a doctor and a patient. We note that the quality and efficiency of the chat experience can be critical as the demand for telehealth services increases. Accordingly, we develop a smart auto-response generation mechanism for medical conversations that helps doctors respond to consultation requests efficiently, particularly during busy sessions. We explore over 900,000 anonymous, historical online messages between doctors and patients collected over nine months. We implement clustering algorithms to identify the most frequent responses by doctors and manually label the data accordingly. We then train machine learning algorithms using this preprocessed data to generate the responses. The considered algorithm has two steps: a filtering (i.e., triggering) model to filter out infeasible patient messages and a response generator to suggest the top-3 doctor responses for the ones that successfully pass the triggering phase. The method provides an accuracy of 83.28\% for precision@3 and shows robustness to its parameters.  ( 3 min )
    DAReN: A Collaborative Approach Towards Reasoning And Disentangling. (arXiv:2109.13156v2 [cs.LG] UPDATED)
    Computational learning approaches to solving visual reasoning tests, such as Raven's Progressive Matrices (RPM), critically depend on the ability to identify the visual concepts used in the test (i.e., the representation) as well as the latent rules based on those concepts (i.e., the reasoning). However, learning of representation and reasoning is a challenging and ill-posed task, often approached in a stage-wise manner (first representation, then reasoning). In this work, we propose an end-to-end joint representation-reasoning learning framework, which leverages a weak form of inductive bias to improve both tasks together. Specifically, we introduce a general generative graphical model for RPMs, GM-RPM, and apply it to solve the reasoning test. We accomplish this using a novel learning framework Disentangling based Abstract Reasoning Network (DAReN) based on the principles of GM-RPM. We perform an empirical evaluation of DAReN over several benchmark datasets. DAReN shows consistent improvement over state-of-the-art (SOTA) models on both the reasoning and the disentanglement tasks. This demonstrates the strong correlation between disentangled latent representation and the ability to solve abstract visual reasoning tasks.  ( 2 min )
    Wasserstein GANs with Gradient Penalty Compute Congested Transport. (arXiv:2109.00528v2 [cs.LG] UPDATED)
    Wasserstein GANs with Gradient Penalty (WGAN-GP) are a very popular method for training generative models to produce high quality synthetic data. While WGAN-GP were initially developed to calculate the Wasserstein 1 distance between generated and real data, recent works (e.g. [23]) have provided empirical evidence that this does not occur, and have argued that WGAN-GP perform well not in spite of this issue, but because of it. In this paper we show for the first time that WGAN-GP compute the minimum of a different optimal transport problem, the so-called congested transport [7]. Congested transport determines the cost of moving one distribution to another under a transport model that penalizes congestion. For WGAN-GP, we find that the congestion penalty has a spatially varying component determined by the sampling strategy used in [12] which acts like a local speed limit, making congestion cost less in some regions than others. This aspect of the congested transport problem is new, in that the congestion penalty turns out to be unbounded and depends on the distributions to be transported, and so we provide the necessary mathematical proofs for this setting. One facet of our discovery is a formula connecting the gradient of solutions to the optimization problem in WGAN-GP to the time averaged momentum of the optimal mass flow. This is in contrast to the gradient of Kantorovich potentials for the Wasserstein 1 distance, which is just the normalized direction of flow. Based on this and other considerations, we speculate on how our results explain the observed performance of WGAN-GP. Beyond applications to GANs, our theorems also point to the possibility of approximately solving large scale congested transport problems using neural network techniques.  ( 3 min )
    Augmenting Reinforcement Learning with Behavior Primitives for Diverse Manipulation Tasks. (arXiv:2110.03655v3 [cs.LG] UPDATED)
    Realistic manipulation tasks require a robot to interact with an environment with a prolonged sequence of motor actions. While deep reinforcement learning methods have recently emerged as a promising paradigm for automating manipulation behaviors, they usually fall short in long-horizon tasks due to the exploration burden. This work introduces Manipulation Primitive-augmented reinforcement Learning (MAPLE), a learning framework that augments standard reinforcement learning algorithms with a pre-defined library of behavior primitives. These behavior primitives are robust functional modules specialized in achieving manipulation goals, such as grasping and pushing. To use these heterogeneous primitives, we develop a hierarchical policy that involves the primitives and instantiates their executions with input parameters. We demonstrate that MAPLE outperforms baseline approaches by a significant margin on a suite of simulated manipulation tasks. We also quantify the compositional structure of the learned behaviors and highlight our method's ability to transfer policies to new task variants and to physical hardware. Videos and code are available at https://ut-austin-rpl.github.io/maple  ( 2 min )
    Universal and data-adaptive algorithms for model selection in linear contextual bandits. (arXiv:2111.04688v2 [cs.LG] UPDATED)
    Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong "feature-diversity" conditions. In this paper, we introduce new algorithms that a) explore in a data-adaptive manner, and b) provide model selection guarantees of the form $\mathcal{O}(d^{\alpha} T^{1- \alpha})$ with no feature diversity conditions whatsoever, where $d$ denotes the dimension of the linear model and $T$ denotes the total number of rounds. The first algorithm enjoys a "best-of-both-worlds" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.  ( 2 min )
    GDA-AM: On the effectiveness of solving minimax optimization via Anderson Acceleration. (arXiv:2110.02457v3 [cs.LG] UPDATED)
    Many modern machine learning algorithms such as generative adversarial networks (GANs) and adversarial training can be formulated as minimax optimization. Gradient descent ascent (GDA) is the most commonly used algorithm due to its simplicity. However, GDA can converge to non-optimal minimax points. We propose a new minimax optimization framework, GDA-AM, that views the GDAdynamics as a fixed-point iteration and solves it using Anderson Mixing to con-verge to the local minimax. It addresses the diverging issue of simultaneous GDAand accelerates the convergence of alternating GDA. We show theoretically that the algorithm can achieve global convergence for bilinear problems under mild conditions. We also empirically show that GDA-AMsolves a variety of minimax problems and improves GAN training on several datasets  ( 2 min )
    A deep convolutional neural network that is invariant to time rescaling. (arXiv:2107.04616v3 [cs.LG] UPDATED)
    Human learners can readily understand speech, or a melody, when it is presented slower or faster than usual. Although deep convolutional neural networks (CNNs) are extremely powerful in extracting information from time series, they require explicit training to generalize to different time scales. This paper presents a deep CNN that incorporates a temporal representation inspired by recent findings from neuroscience. In the mammalian brain, time is represented by populations of neurons with temporal receptive fields. Critically, the peaks of the receptive fields form a geometric series, such that the population codes a set of temporal basis functions over log time. Because memory for the recent past is a function of log time, rescaling the input results in translation of the memory. The Scale-Invariant Temporal History Convolution network (SITHCon) builds a convolutional layer over this logarithmically-distributed temporal memory. A max-pool operation results in a network that is invariant to rescalings of time modulo edge effects. We compare performance of SITHCon to a Temporal Convolution Network (TCN). Although both networks can learn classification and regression problems on both univariate and multivariate time series f(t), only SITHCon generalizes to rescalings f(at). This property, inspired by findings from contemporary neuroscience and consistent with findings from cognitive psychology, may enable networks that learn with fewer training examples, fewer weights and that generalize more robustly to out of sample data.  ( 3 min )
    A Latent Restoring Force Approach to Nonlinear System Identification. (arXiv:2109.10681v2 [stat.ML] UPDATED)
    Identification of nonlinear dynamic systems remains a significant challenge across engineering. This work suggests an approach based on Bayesian filtering to extract and identify the contribution of an unknown nonlinear term in the system which can be seen as an alternative viewpoint on restoring force surface type approaches. To achieve this identification, the contribution which is the nonlinear restoring force is modelled, initially, as a Gaussian process in time. That Gaussian process is converted into a state-space model and combined with the linear dynamic component of the system. Then, by inference of the filtering and smoothing distributions, the internal states of the system and the nonlinear restoring force can be extracted. In possession of these states a nonlinear model can be constructed. The approach is demonstrated to be effective in both a simulated case study and on an experimental benchmark dataset.  ( 2 min )
    FL-Tuning: Layer Tuning for Feed-Forward Network in Transformer. (arXiv:2206.15312v1 [cs.CL])
    Prompt tuning is an emerging way of adapting pre-trained language models to downstream tasks. However, the existing studies are mainly to add prompts to the input sequence. This way would not work as expected due to the intermediate multi-head self-attention and feed-forward network computation, making model optimization not very smooth. Hence, we propose a novel tuning way called layer tuning, aiming to add learnable parameters in Transformer layers. Specifically, we focus on layer tuning for feed-forward network in the Transformer, namely FL-tuning. It introduces additional units into the hidden layer of each feed-forward network. We conduct extensive experiments on the public CLUE benchmark. The results show that: 1) Our FL-tuning outperforms prompt tuning methods under both full-data and few-shot settings in almost all cases. In particular, it improves accuracy by 17.93% (full-data setting) on WSC 1.0 and F1 by 16.142% (few-shot setting) on CLUENER over P-tuning v2. 2) Our FL-tuning is more stable and converges about 1.17 times faster than P-tuning v2. 3) With only about 3% of Transformer's parameters to be trained, FL-tuning is comparable with fine-tuning on most datasets, and significantly outperforms fine-tuning (e.g., accuracy improved by 12.9% on WSC 1.1) on several datasets. The source codes are available at https://github.com/genggui001/FL-Tuning.  ( 2 min )
    On the Learning and Learnablity of Quasimetrics. (arXiv:2206.15478v1 [cs.LG])
    Our world is full of asymmetries. Gravity and wind can make reaching a place easier than coming back. Social artifacts such as genealogy charts and citation graphs are inherently directed. In reinforcement learning and control, optimal goal-reaching strategies are rarely reversible (symmetrical). Distance functions supported on these asymmetrical structures are called quasimetrics. Despite their common appearance, little research has been done on the learning of quasimetrics. Our theoretical analysis reveals that a common class of learning algorithms, including unconstrained multilayer perceptrons (MLPs), provably fails to learn a quasimetric consistent with training data. In contrast, our proposed Poisson Quasimetric Embedding (PQE) is the first quasimetric learning formulation that both is learnable with gradient-based optimization and enjoys strong performance guarantees. Experiments on random graphs, social graphs, and offline Q-learning demonstrate its effectiveness over many common baselines.  ( 2 min )
    Implicit Neural Spatial Filtering for Multichannel Source Separation in the Waveform Domain. (arXiv:2206.15423v1 [cs.SD])
    We present a single-stage casual waveform-to-waveform multichannel model that can separate moving sound sources based on their broad spatial locations in a dynamic acoustic scene. We divide the scene into two spatial regions containing, respectively, the target and the interfering sound sources. The model is trained end-to-end and performs spatial processing implicitly, without any components based on traditional processing or use of hand-crafted spatial features. We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer followed by a state-of-the-art single-channel enhancement network.
    SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification. (arXiv:2103.16725v2 [cs.CV] UPDATED)
    A common classification task situation is where one has a large amount of data available for training, but only a small portion is annotated with class labels. The goal of semi-supervised training, in this context, is to improve classification accuracy by leverage information not only from labeled data but also from a large amount of unlabeled data. Recent works have developed significant improvements by exploring the consistency constrain between differently augmented labeled and unlabeled data. Following this path, we propose a novel unsupervised objective that focuses on the less studied relationship between the high confidence unlabeled data that are similar to each other. The new proposed Pair Loss minimizes the statistical distance between high confidence pseudo labels with similarity above a certain threshold. Combining the Pair Loss with the techniques developed by the MixMatch family, our proposed SimPLE algorithm shows significant performance gains over previous algorithms on CIFAR-100 and Mini-ImageNet, and is on par with the state-of-the-art methods on CIFAR-10 and SVHN. Furthermore, SimPLE also outperforms the state-of-the-art methods in the transfer learning setting, where models are initialized by the weights pre-trained on ImageNet or DomainNet-Real. The code is available at github.com/zijian-hu/SimPLE.  ( 3 min )
    Neural Annotation Refinement: Development of a New 3D Dataset for Adrenal Gland Analysis. (arXiv:2206.15328v1 [cs.CV])
    The human annotations are imperfect, especially when produced by junior practitioners. Multi-expert consensus is usually regarded as golden standard, while this annotation protocol is too expensive to implement in many real-world projects. In this study, we propose a method to refine human annotation, named Neural Annotation Refinement (NeAR). It is based on a learnable implicit function, which decodes a latent vector into represented shape. By integrating the appearance as an input of implicit functions, the appearance-aware NeAR fixes the annotation artefacts. Our method is demonstrated on the application of adrenal gland analysis. We first show that the NeAR can repair distorted golden standards on a public adrenal gland segmentation dataset. Besides, we develop a new Adrenal gLand ANalysis (ALAN) dataset with the proposed NeAR, where each case consists of a 3D shape of adrenal gland and its diagnosis label (normal vs. abnormal) assigned by experts. We show that models trained on the shapes repaired by the NeAR can diagnose adrenal glands better than the original ones. The ALAN dataset will be open-source, with 1,594 shapes for adrenal gland diagnosis, which serves as a new benchmark for medical shape analysis. Code and dataset are available at https://github.com/M3DV/NeAR.  ( 3 min )
    Why we do need Explainable AI for Healthcare. (arXiv:2206.15363v1 [cs.HC])
    The recent spike in certified Artificial Intelligence (AI) tools for healthcare has renewed the debate around adoption of this technology. One thread of such debate concerns Explainable AI and its promise to render AI devices more transparent and trustworthy. A few voices active in the medical AI space have expressed concerns on the reliability of Explainable AI techniques, questioning their use and inclusion in guidelines and standards. Revisiting such criticisms, this article offers a balanced and comprehensive perspective on the utility of Explainable AI, focusing on the specificity of clinical applications of AI and placing them in the context of healthcare interventions. Against its detractors and despite valid concerns, we argue that the Explainable AI research program is still central to human-machine interaction and ultimately our main tool against loss of control, a danger that cannot be prevented by rigorous clinical validation alone.
    Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning. (arXiv:2206.15143v1 [cs.LG])
    The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC algorithms require to compute and communicate a large volume of second-order information, i.e., Kronecker factors (KFs), before preconditioning gradients, resulting in large computation and communication overheads as well as a high memory footprint. In this paper, we propose DP-KFAC, a novel distributed preconditioning scheme that distributes the KF constructing tasks at different DNN layers to different workers. DP-KFAC not only retains the convergence property of the existing D-KFAC algorithms but also enables three benefits: reduced computation overhead in constructing KFs, no communication of KFs, and low memory footprint. Extensive experiments on a 64-GPU cluster show that DP-KFAC reduces the computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update compared to the state-of-the-art D-KFAC methods.
    Graph-Time Convolutional Neural Networks: Architecture and Theoretical Analysis. (arXiv:2206.15174v1 [cs.LG])
    Devising and analyzing learning models for spatiotemporal network data is of importance for tasks including forecasting, anomaly detection, and multi-agent coordination, among others. Graph Convolutional Neural Networks (GCNNs) are an established approach to learn from time-invariant network data. The graph convolution operation offers a principled approach to aggregate multiresolution information. However, extending the convolution principled learning and respective analysis to the spatiotemporal domain is challenging because spatiotemporal data have more intrinsic dependencies. Hence, a higher flexibility to capture jointly the spatial and the temporal dependencies is required to learn meaningful higher-order representations. Here, we leverage product graphs to represent the spatiotemporal dependencies in the data and introduce Graph-Time Convolutional Neural Networks (GTCNNs) as a principled architecture to aid learning. The proposed approach can work with any type of product graph and we also introduce a parametric product graph to learn also the spatiotemporal coupling. The convolution principle further allows a similar mathematical tractability as for GCNNs. In particular, the stability result shows GTCNNs are stable to spatial perturbations but there is an implicit trade-off between discriminability and robustness; i.e., the more complex the model, the less stable. Extensive numerical results on benchmark datasets corroborate our findings and show the GTCNN compares favorably with state-of-the-art solutions. We anticipate the GTCNN to be a starting point for more sophisticated models that achieve good performance but are also fundamentally grounded.
    Laplacian Autoencoders for Learning Stochastic Representations. (arXiv:2206.15078v1 [cs.LG])
    Representation learning has become a practical family of methods for building rich parametric codifications of massive high-dimensional data while succeeding in the reconstruction side. When considering unsupervised tasks with test-train distribution shifts, the probabilistic viewpoint helps for addressing overconfidence and poor calibration of predictions. However, the direct introduction of Bayesian inference on top of neural networks weights is still an ardous problem for multiple reasons, i.e. the curse of dimensionality or intractability issues. The Laplace approximation (LA) offers a solution here, as one may build Gaussian approximations of the posterior density of weights via second-order Taylor expansions in certain locations of the parameter space. In this work, we present a Bayesian autoencoder for unsupervised representation learning inspired in LA. Our method implements iterative Laplace updates to obtain a novel variational lower-bound of the autoencoder evidence. The vast computational burden of the second-order partial derivatives is skipped via approximations of the Hessian matrix. Empirically, we demonstrate the scalability and performance of the Laplacian autoencoder by providing well-calibrated uncertainties for out-of-distribution detection, geodesics for differential geometry and missing data imputations.
    HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection. (arXiv:2206.15157v1 [cs.CV])
    Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera and lidar or camera and radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we focus on 2D object detection, a fundamental high-level task which is defined on the 2D image domain, and propose HRFuser, a multi-resolution sensor fusion architecture that scales straightforwardly to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at multiple resolutions. Even though cameras alone provide very informative features for 2D detection, we demonstrate via extensive experiments on the nuScenes and Seeing Through Fog datasets that our model effectively leverages complementary features from additional modalities, substantially improving upon camera-only performance and consistently outperforming state-of-the-art fusion methods for 2D detection both in normal and adverse conditions. The source code will be made publicly available.
    A note on large deviations for interacting particle dynamics for finding mixed equilibria in zero-sum games. (arXiv:2206.15177v1 [stat.ML])
    Finding equilibria points in continuous minimax games has become a key problem within machine learning, in part due to its connection to the training of generative adversarial networks. Because of existence and robustness issues, recent developments have shifted from pure equilibria to focusing on mixed equilibria points. In this note we consider a method proposed by Domingo-Enrich et al. for finding mixed equilibria in two-layer zero-sum games. The method is based on entropic regularisation and the two competing strategies are represented by two sets of interacting particles. We show that the sequence of empirical measures of the particle system satisfies a large deviation principle as the number of particles grows to infinity, and how this implies convergence of the empirical measure and the associated Nikaid\^o-Isoda error, complementing existing law of large numbers results.
    Practical Black Box Hamiltonian Learning. (arXiv:2206.15464v1 [quant-ph])
    We study the problem of learning the parameters for the Hamiltonian of a quantum many-body system, given limited access to the system. In this work, we build upon recent approaches to Hamiltonian learning via derivative estimation. We propose a protocol that improves the scaling dependence of prior works, particularly with respect to parameters relating to the structure of the Hamiltonian (e.g., its locality $k$). Furthermore, by deriving exact bounds on the performance of our protocol, we are able to provide a precise numerical prescription for theoretically optimal settings of hyperparameters in our learning protocol, such as the maximum evolution time (when learning with unitary dynamics) or minimum temperature (when learning with Gibbs states). Thanks to these improvements, our protocol is practical for large problems: we demonstrate this with a numerical simulation of our protocol on an 80-qubit system.
    Neural Networks can Learn Representations with Gradient Descent. (arXiv:2206.15144v1 [cs.LG])
    Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f^\star(x) = g(Ux)$ where $U: \R^d \to \R^r$ with $d \gg r$. When the degree of $f^\star$ is $p$, it is known that $n \asymp d^p$ samples are necessary to learn $f^\star$ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to $f^\star$. This results in an improved sample complexity of $n\asymp d^2 r + dr^p$. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.
    LIDL: Local Intrinsic Dimension Estimation Using Approximate Likelihood. (arXiv:2206.14882v1 [stat.ML])
    Most of the existing methods for estimating the local intrinsic dimension of a data distribution do not scale well to high-dimensional data. Many of them rely on a non-parametric nearest neighbors approach which suffers from the curse of dimensionality. We attempt to address that challenge by proposing a novel approach to the problem: Local Intrinsic Dimension estimation using approximate Likelihood (LIDL). Our method relies on an arbitrary density estimation method as its subroutine and hence tries to sidestep the dimensionality challenge by making use of the recent progress in parametric neural methods for likelihood estimation. We carefully investigate the empirical properties of the proposed method, compare them with our theoretical predictions, and show that LIDL yields competitive results on the standard benchmarks for this problem and that it scales to thousands of dimensions. What is more, we anticipate this approach to improve further with the continuing advances in the density estimation literature.
    Causality-Based Multivariate Time Series Anomaly Detection. (arXiv:2206.15033v1 [cs.LG])
    Anomaly detection in multivariate time series plays an important role in monitoring the behaviors of various real-world systems, e.g., IT system operations or manufacturing industry. Previous approaches model the joint distribution without considering the underlying mechanism of multivariate time series, making them complicated and computationally hungry. In this paper, we formulate the anomaly detection problem from a causal perspective and view anomalies as instances that do not follow the regular causal mechanism to generate the multivariate data. We then propose a causality-based anomaly detection approach, which first learns the causal structure from data and then infers whether an instance is an anomaly relative to the local causal mechanism to generate each variable from its direct causes, whose conditional distribution can be directly estimated from data. In light of the modularity property of causal systems, the original problem is divided into a series of separate low-dimensional anomaly detection problems so that where an anomaly happens can be directly identified. We evaluate our approach with both simulated and public datasets as well as a case study on real-world AIOps applications, showing its efficacy, robustness, and practical feasibility.
    Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?. (arXiv:2206.14969v1 [cs.CL])
    Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model's ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges. Our code is available at https://github.com/owenzx/MPoSM  ( 2 min )
    Machine Learning Approaches to Predict Breast Cancer: Bangladesh Perspective. (arXiv:2206.14972v1 [cs.LG])
    Nowadays, Breast cancer has risen to become one of the most prominent causes of death in recent years. Among all malignancies, this is the most frequent and the major cause of death for women globally. Manually diagnosing this disease requires a good amount of time and expertise. Breast cancer detection is time-consuming, and the spread of the disease can be reduced by developing machine-based breast cancer predictions. In Machine learning, the system can learn from prior instances and find hard-to-detect patterns from noisy or complicated data sets using various statistical, probabilistic, and optimization approaches. This work compares several machine learning algorithm's classification accuracy, precision, sensitivity, and specificity on a newly collected dataset. In this work Decision tree, Random Forest, Logistic Regression, Naive Bayes, and XGBoost, these five machine learning approaches have been implemented to get the best performance on our dataset. This study focuses on finding the best algorithm that can forecast breast cancer with maximum accuracy in terms of its classes. This work evaluated the quality of each algorithm's data classification in terms of efficiency and effectiveness. And also compared with other published work on this domain. After implementing the model, this study achieved the best model accuracy, 94% on Random Forest and XGBoost.  ( 3 min )
    Semantic Unfolding of StyleGAN Latent Space. (arXiv:2206.14892v1 [cs.CV])
    Generative adversarial networks (GANs) have proven to be surprisingly efficient for image editing by inverting and manipulating the latent code corresponding to an input real image. This editing property emerges from the disentangled nature of the latent space. In this paper, we identify that the facial attribute disentanglement is not optimal, thus facial editing relying on linear attribute separation is flawed. We thus propose to improve semantic disentanglement with supervision. Our method consists in learning a proxy latent representation using normalizing flows, and we show that this leads to a more efficient space for face image editing.  ( 2 min )
    Stochastic Bilevel Distributed Optimization over a Network. (arXiv:2206.15025v1 [cs.LG])
    Bilevel optimization has been applied to a wide variety of machine learning models. Numerous stochastic bilevel optimization algorithms have been developed in recent years. However, most of them restrict their focus on the single-machine setting so that they are incapable of handling the distributed data. To address this issue, under the setting where all participants compose a network and perform the peer-to-peer communication in this network, we developed two novel distributed stochastic bilevel optimization algorithms based on the gradient tracking communication mechanism and two different gradient estimators. Additionally, we show that they can achieve $O(\frac{1}{\epsilon^{2}(1-\lambda)^2})$ and $O(\frac{1}{\epsilon^{3/2}(1-\lambda)^2})$ convergence rate respectively to obtain the $\epsilon$-accuracy solution, where $1-\lambda$ denotes the spectral gap of the communication network. To our knowledge, this is the first work achieving these theoretical results. Finally, we applied our algorithms to practical machine learning models, and the experimental results confirmed the efficacy of our algorithms.  ( 2 min )
    Best of Both Worlds Model Selection. (arXiv:2206.14912v1 [cs.LG])
    We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ("best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner's candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds, specifically in the case of nested adversarial linear bandits. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in (linear) bandit scenarios.  ( 2 min )
    Continuous-Time and Multi-Level Graph Representation Learning for Origin-Destination Demand Prediction. (arXiv:2206.15005v1 [cs.LG])
    Traffic demand forecasting by deep neural networks has attracted widespread interest in both academia and industry society. Among them, the pairwise Origin-Destination (OD) demand prediction is a valuable but challenging problem due to several factors: (i) the large number of possible OD pairs, (ii) implicitness of spatial dependence, and (iii) complexity of traffic states. To address the above issues, this paper proposes a Continuous-time and Multi-level dynamic graph representation learning method for Origin-Destination demand prediction (CMOD). Firstly, a continuous-time dynamic graph representation learning framework is constructed, which maintains a dynamic state vector for each traffic node (metro stations or taxi zones). The state vectors keep historical transaction information and are continuously updated according to the most recently happened transactions. Secondly, a multi-level structure learning module is proposed to model the spatial dependency of station-level nodes. It can not only exploit relations between nodes adaptively from data, but also share messages and representations via cluster-level and area-level virtual nodes. Lastly, a cross-level fusion module is designed to integrate multi-level memories and generate comprehensive node representations for the final prediction. Extensive experiments are conducted on two real-world datasets from Beijing Subway and New York Taxi, and the results demonstrate the superiority of our model against the state-of-the-art approaches.  ( 3 min )
    Lookback for Learning to Branch. (arXiv:2206.14987v1 [cs.LG])
    The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.  ( 3 min )
    A Validity Perspective on Evaluating the Justified Use of Data-driven Decision-making Algorithms. (arXiv:2206.14983v1 [cs.LG])
    This work seeks to center validity considerations in deliberations around whether and how to build data-driven algorithms in high-stakes domains. Toward this end, we translate key concepts from validity theory to predictive algorithms. We describe common challenges in problem formulation and data issues that jeopardize the validity of predictive algorithms. We distill these issues into a series of high-level questions intended to promote and document reflections on the legitimacy of the predictive task and the suitability of the data. This contribution lays the foundation for co-designing a validity protocol, in collaboration with real-world stakeholders, including decision-makers, modelers, and members of potentially impacted communities, to critically evaluate the justifiability of specific designs and uses of data-driven algorithmic systems.  ( 2 min )
    Manifold Interpolating Optimal-Transport Flows for Trajectory Inference. (arXiv:2206.14928v1 [cs.LG])
    Here, we present a method called Manifold Interpolating Optimal-Transport Flow (MIOFlow) that learns stochastic, continuous population dynamics from static snapshot samples taken at sporadic timepoints. MIOFlow combines dynamic models, manifold learning, and optimal transport by training neural ordinary differential equations (Neural ODE) to interpolate between static population snapshots as penalized by optimal transport with manifold ground distance. Further, we ensure that the flow follows the geometry by operating in the latent space of an autoencoder that we call a geodesic autoencoder (GAE). In GAE the latent space distance between points is regularized to match a novel multiscale geodesic distance on the data manifold that we define. We show that this method is superior to normalizing flows, Schr\"odinger bridges and other generative models that are designed to flow from noise to data in terms of interpolating between populations. Theoretically, we link these trajectories with dynamic optimal transport. We evaluate our method on simulated data with bifurcations and merges, as well as scRNA-seq data from embryoid body differentiation, and acute myeloid leukemia treatment.  ( 2 min )
    On Non-Random Missing Labels in Semi-Supervised Learning. (arXiv:2206.14923v1 [cs.CV])
    Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solutions that overlook the role of "class" in causing the non-randomness, e.g., users are more likely to label popular classes, we explicitly incorporate "class" into SSL. Our method is three-fold: 1) We propose Class-Aware Propensity (CAP) that exploits the unlabeled data to train an improved classifier using the biased labeled data. 2) To encourage rare class training, whose model is low-recall but high-precision that discards too many pseudo-labeled data, we propose Class-Aware Imputation (CAI) that dynamically decreases (or increases) the pseudo-label assignment threshold for rare (or frequent) classes. 3) Overall, we integrate CAP and CAI into a Class-Aware Doubly Robust (CADR) estimator for training an unbiased SSL model. Under various MNAR settings and ablations, our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods. Please check our code at: https://github.com/JoyHuYY1412/CADR-FixMatch.  ( 2 min )
    Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation. (arXiv:2206.15047v1 [cs.LG])
    Ensembles of deep neural networks have demonstrated superior performance, but their heavy computational cost hinders applying them for resource-limited environments. It motivates distilling knowledge from the ensemble teacher into a smaller student network, and there are two important design choices for this ensemble distillation: 1) how to construct the student network, and 2) what data should be shown during training. In this paper, we propose a weight averaging technique where a student with multiple subnetworks is trained to absorb the functional diversity of ensemble teachers, but then those subnetworks are properly averaged for inference, giving a single student network with no additional inference cost. We also propose a perturbation strategy that seeks inputs from which the diversities of teachers can be better transferred to the student. Combining these two, our method significantly improves upon previous methods on various image classification tasks.  ( 2 min )
    Semi-Supervised Generative Adversarial Network for Stress Detection Using Partially Labeled Physiological Data. (arXiv:2206.14976v1 [cs.LG])
    Physiological measurements involves observing variables that attribute to the normative functioning of human systems and subsystems directly or indirectly. The measurements can be used to detect affective states of a person with aims such as improving human-computer interactions. There are several methods of collecting physiological data, but wearable sensors are a common, non-invasive tool for accurate readings. However, valuable information is hard to extract from the raw physiological data, especially for affective state detection. Machine Learning techniques are used to detect the affective state of a person through labeled physiological data. A clear problem with using labeled data is creating accurate labels. An expert is needed to analyze a form of recording of participants and mark sections with different states such as stress and calm. While expensive, this method delivers a complete dataset with labeled data that can be used in any number of supervised algorithms. An interesting question arises from the expensive labeling: how can we reduce the cost while maintaining high accuracy? Semi-Supervised learning (SSL) is a potential solution to this problem. These algorithms allow for machine learning models to be trained with only a small subset of labeled data (unlike unsupervised which use no labels). They provide a way of avoiding expensive labeling. This paper compares a fully supervised algorithm to a SSL on the public WESAD (Wearable Stress and Affect Detection) Dataset for stress detection. This paper shows that Semi-Supervised algorithms are a viable method for inexpensive affective state detection systems with accurate results.  ( 3 min )
    Discrete Langevin Sampler via Wasserstein Gradient Flow. (arXiv:2206.14897v1 [cs.LG])
    Recently, a family of locally balanced (LB) samplers has demonstrated excellent performance at sampling and learning energy-based models (EBMs) in discrete spaces. However, the theoretical understanding of this success is limited. In this work, we show how LB functions give rise to LB dynamics corresponding to Wasserstein gradient flow in a discrete space. From first principles, previous LB samplers can then be seen as discretizations of the LB dynamics with respect to Hamming distance. Based on this observation, we propose a new algorithm, the Locally Balanced Jump (LBJ), by discretizing the LB dynamics with respect to simulation time. As a result, LBJ has a location-dependent "velocity" that allows it to make proposals with larger distances. Additionally, LBJ decouples each dimension into independent sub-processes, enabling convenient parallel implementation. We demonstrate the advantages of LBJ for sampling and learning in various binary and categorical distributions.  ( 2 min )
    Decision Forest Based EMG Signal Classification with Low Volume Dataset Augmented with Random Variance Gaussian Noise. (arXiv:2206.14947v1 [q-bio.NC])
    Electromyography signals can be used as training data by machine learning models to classify various gestures. We seek to produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience while comparing the effect of our feature extraction results on model accuracy to other more conventional methods such as the use of AR parameters on a sliding window across the channels of a signal. We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting where EMG classification is being conducted, as opposed to more complicated methods such as the use of the Fourier Transform. To augment our limited training data, we used a standard technique, known as jitter, where random noise is added to each observation in a channel wise manner. Once all datasets were produced using the above methods, we performed a grid search with Random Forest and XGBoost to ultimately create a high accuracy model. For human computer interface purposes, high accuracy classification of EMG signals is of particular importance to their functioning and given the difficulty and cost of amassing any sort of biomedical data in a high volume, it is valuable to have techniques that can work with a low amount of high-quality samples with less expensive feature extraction methods that can reliably be carried out in an online application.  ( 3 min )
    Towards Federated Long-Tailed Learning. (arXiv:2206.14988v1 [cs.LG])
    Data privacy and class imbalance are the norm rather than the exception in many machine learning tasks. Recent attempts have been launched to, on one side, address the problem of learning from pervasive private data, and on the other side, learn from long-tailed data. However, both assumptions might hold in practical applications, while an effective method to simultaneously alleviate both issues is yet under development. In this paper, we focus on learning with long-tailed (LT) data distributions under the context of the popular privacy-preserved federated learning (FL) framework. We characterize three scenarios with different local or global long-tailed data distributions in the FL framework, and highlight the corresponding challenges. The preliminary results under different scenarios reveal that substantial future work are of high necessity to better resolve the characterized federated long-tailed learning tasks.  ( 2 min )
    Solving Quantitative Reasoning Problems with Language Models. (arXiv:2206.14858v1 [cs.CL])
    Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.  ( 2 min )
    Causality for Inherently Explainable Transformers: CAT-XPLAIN. (arXiv:2206.14841v1 [cs.CV])
    There have been several post-hoc explanation approaches developed to explain pre-trained black-box neural networks. However, there is still a gap in research efforts toward designing neural networks that are inherently explainable. In this paper, we utilize a recently proposed instance-wise post-hoc causal explanation method to make an existing transformer architecture inherently explainable. Once trained, our model provides an explanation in the form of top-$k$ regions in the input space of the given instance contributing to its decision. We evaluate our method on binary classification tasks using three image datasets: MNIST, FMNIST, and CIFAR. Our results demonstrate that compared to the causality-based post-hoc explainer model, our inherently explainable model achieves better explainability results while eliminating the need of training a separate explainer model. Our code is available at https://github.com/mvrl/CAT-XPLAIN.  ( 2 min )
    Randomized Coordinate Subgradient Method for Nonsmooth Optimization. (arXiv:2206.14981v1 [math.OC])
    Nonsmooth optimization finds wide applications in many engineering fields. In this work, we propose to utilize the {Randomized Coordinate Subgradient Method} (RCS) for solving both nonsmooth convex and nonsmooth nonconvex (nonsmooth weakly convex) optimization problems. At each iteration, RCS randomly selects one block coordinate rather than all the coordinates to update. Motivated by practical applications, we consider the {linearly bounded subgradients assumption} for the objective function, which is much more general than the Lipschitz continuity assumption. Under such a general assumption, we conduct thorough convergence analysis for RCS in both convex and nonconvex cases and establish both expected convergence rate and almost sure asymptotic convergence results. In order to derive these convergence results, we establish a convergence lemma and the relationship between the global metric subregularity properties of a weakly convex function and its Moreau envelope, which are fundamental and of independent interests. Finally, we conduct several experiments to show the possible superiority of RCS over the subgradient method.  ( 2 min )
    Momentum Diminishes the Effect of Spectral Bias in Physics-Informed Neural Networks. (arXiv:2206.14862v1 [cs.LG])
    Physics-informed neural network (PINN) algorithms have shown promising results in solving a wide range of problems involving partial differential equations (PDEs). However, they often fail to converge to desirable solutions when the target function contains high-frequency features, due to a phenomenon known as spectral bias. In the present work, we exploit neural tangent kernels (NTKs) to investigate the training dynamics of PINNs evolving under stochastic gradient descent with momentum (SGDM). This demonstrates SGDM significantly reduces the effect of spectral bias. We have also examined why training a model via the Adam optimizer can accelerate the convergence while reducing the spectral bias. Moreover, our numerical experiments have confirmed that wide-enough networks using SGDM still converge to desirable solutions, even in the presence of high-frequency features. In fact, we show that the width of a network plays a critical role in convergence.  ( 2 min )
    AFAFed -- Protocol analysis. (arXiv:2206.14927v1 [cs.LG])
    In this paper, we design, analyze the convergence properties and address the implementation aspects of AFAFed. This is a novel Asynchronous Fair Adaptive Federated learning framework for stream-oriented IoT application environments, which are featured by time-varying operating conditions, heterogeneous resource-limited devices (i.e., coworkers), non-i.i.d. local training data and unreliable communication links. The key new of AFAFed is the synergic co-design of: (i) two sets of adaptively tuned tolerance thresholds and fairness coefficients at the coworkers and central server, respectively; and, (ii) a distributed adaptive mechanism, which allows each coworker to adaptively tune own communication rate. The convergence properties of AFAFed under (possibly) non-convex loss functions is guaranteed by a set of new analytical bounds, which formally unveil the impact on the resulting AFAFed convergence rate of a number of Federated Learning (FL) parameters, like, first and second moments of the per-coworker number of consecutive model updates, data skewness, communication packet-loss probability, and maximum/minimum values of the (adaptively tuned) mixing coefficient used for model aggregation.  ( 2 min )
    A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback. (arXiv:2206.14906v1 [cs.LG])
    We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}\log K}) + d K^{1/3}\log K\right)$, where $\Delta_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $\sigma_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.  ( 2 min )
    Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization. (arXiv:2206.14846v1 [cs.LG])
    Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde O(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic network demonstrate the efficiency of our algorithm.  ( 2 min )
    Fairness via In-Processing in the Over-parameterized Regime: A Cautionary Tale. (arXiv:2206.14853v1 [cs.LG])
    The success of DNNs is driven by the counter-intuitive ability of over-parameterized networks to generalize, even when they perfectly fit the training data. In practice, test error often continues to decrease with increasing over-parameterization, referred to as double descent. This allows practitioners to instantiate large models without having to worry about over-fitting. Despite its benefits, however, prior work has shown that over-parameterization can exacerbate bias against minority subgroups. Several fairness-constrained DNN training methods have been proposed to address this concern. Here, we critically examine MinDiff, a fairness-constrained training procedure implemented within TensorFlow's Responsible AI Toolkit, that aims to achieve Equality of Opportunity. We show that although MinDiff improves fairness for under-parameterized models, it is likely to be ineffective in the over-parameterized regime. This is because an overfit model with zero training loss is trivially group-wise fair on training data, creating an "illusion of fairness," thus turning off the MinDiff optimization (this will apply to any disparity-based measures which care about errors or accuracy. It won't apply to demographic parity). Within specified fairness constraints, under-parameterized MinDiff models can even have lower error compared to their over-parameterized counterparts (despite baseline over-parameterized models having lower error). We further show that MinDiff optimization is very sensitive to choice of batch size in the under-parameterized regime. Thus, fair model training using MinDiff requires time-consuming hyper-parameter searches. Finally, we suggest using previously proposed regularization techniques, viz. L2, early stopping and flooding in conjunction with MinDiff to train fair over-parameterized models.  ( 3 min )
    Strong Lensing Source Reconstruction Using Continuous Neural Fields. (arXiv:2206.14820v1 [astro-ph.CO])
    From the nature of dark matter to the rate of expansion of our Universe, observations of distant galaxies distorted through strong gravitational lensing have the potential to answer some of the major open questions in astrophysics. Modeling galaxy-galaxy strong lensing observations presents a number of challenges as the exact configuration of both the background source and foreground lens galaxy is unknown. A timely call, prompted by a number of upcoming surveys anticipating high-resolution lensing images, demands methods that can efficiently model lenses at their full complexity. In this work, we introduce a method that uses continuous neural fields to non-parametrically reconstruct the complex morphology of a source galaxy while simultaneously inferring a distribution over foreground lens galaxy configurations. We demonstrate the efficacy of our method through experiments on simulated data targeting high-resolution lensing images similar to those anticipated in near-future astrophysical surveys.  ( 2 min )
  • Open

    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v2 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class -- in our case, empirical Rademacher complexity -- by how much can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and a (2,1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, orthogonal to classic compression via weight pruning.  ( 2 min )
    Rethinking Exponential Averaging of the Fisher. (arXiv:2204.04718v2 [cs.LG] UPDATED)
    In optimization for Machine learning (ML), it is typical that curvature-matrix (CM) estimates rely on an exponential average (EA) of local estimates (giving EA-CM algorithms). This approach has little principled justification, but is very often used in practice. In this paper, we draw a connection between EA-CM algorithms and what we call a "Wake of Quadratic regularized models". The outlined connection allows us to understand what EA-CM algorithms are doing from an optimization perspective. Generalizing from the established connection, we propose a new family of algorithms, "KL-Divergence Wake-Regularized Models" (KLD-WRM). We give three different practical instantiations of KLD-WRM, and show numerically that these outperform K-FAC on MNIST.  ( 2 min )
    A Latent Restoring Force Approach to Nonlinear System Identification. (arXiv:2109.10681v2 [stat.ML] UPDATED)
    Identification of nonlinear dynamic systems remains a significant challenge across engineering. This work suggests an approach based on Bayesian filtering to extract and identify the contribution of an unknown nonlinear term in the system which can be seen as an alternative viewpoint on restoring force surface type approaches. To achieve this identification, the contribution which is the nonlinear restoring force is modelled, initially, as a Gaussian process in time. That Gaussian process is converted into a state-space model and combined with the linear dynamic component of the system. Then, by inference of the filtering and smoothing distributions, the internal states of the system and the nonlinear restoring force can be extracted. In possession of these states a nonlinear model can be constructed. The approach is demonstrated to be effective in both a simulated case study and on an experimental benchmark dataset.  ( 2 min )
    Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms. (arXiv:2203.02474v2 [stat.ML] UPDATED)
    Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.  ( 3 min )
    A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback. (arXiv:2206.14906v1 [cs.LG])
    We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is $\mathcal{O}(\sqrt{TK} + \sqrt{dT\log K})$, where $T$ is the time horizon, $K$ is the number of arms, and $d$ is the fixed delay, whereas the stochastic regret guarantee is $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{d}{\Delta_{i}\log K}) + d K^{1/3}\log K\right)$, where $\Delta_i$ are the suboptimality gaps. We also present an extension of the algorithm to the case of arbitrary delays, which is based on an oracle knowledge of the maximal delay $d_{max}$ and achieves $\mathcal{O}(\sqrt{TK} + \sqrt{D\log K} + d_{max}K^{1/3} \log K)$ regret in the adversarial regime, where $D$ is the total delay, and $\mathcal{O}\left(\sum_{i \neq i^*}(\frac{1}{\Delta_i} \log(T) + \frac{\sigma_{max}}{\Delta_{i}\log K}) + d_{max}K^{1/3}\log K\right)$ regret in the stochastic regime, where $\sigma_{max}$ is the maximal number of outstanding observations. Finally, we present a lower bound that matches regret upper bound achieved by the skipping technique of Zimmert and Seldin [2020] in the adversarial setting.  ( 2 min )
    LIDL: Local Intrinsic Dimension Estimation Using Approximate Likelihood. (arXiv:2206.14882v1 [stat.ML])
    Most of the existing methods for estimating the local intrinsic dimension of a data distribution do not scale well to high-dimensional data. Many of them rely on a non-parametric nearest neighbors approach which suffers from the curse of dimensionality. We attempt to address that challenge by proposing a novel approach to the problem: Local Intrinsic Dimension estimation using approximate Likelihood (LIDL). Our method relies on an arbitrary density estimation method as its subroutine and hence tries to sidestep the dimensionality challenge by making use of the recent progress in parametric neural methods for likelihood estimation. We carefully investigate the empirical properties of the proposed method, compare them with our theoretical predictions, and show that LIDL yields competitive results on the standard benchmarks for this problem and that it scales to thousands of dimensions. What is more, we anticipate this approach to improve further with the continuing advances in the density estimation literature.  ( 2 min )
    A note on Linear Bottleneck networks and their Transition to Multilinearity. (arXiv:2206.15058v1 [cs.LG])
    Randomly initialized wide neural networks transition to linear functions of weights as the width grows, in a ball of radius $O(1)$ around initialization. A necessary condition for this result is that all layers of the network are wide enough, i.e., all widths tend to infinity. However, the transition to linearity breaks down when this infinite width assumption is violated. In this work we show that linear networks with a bottleneck layer learn bilinear functions of the weights, in a ball of radius $O(1)$ around initialization. In general, for $B-1$ bottleneck layers, the network is a degree $B$ multilinear function of weights. Importantly, the degree only depends on the number of bottlenecks and not the total depth of the network.  ( 2 min )
    Reconstructing the Universe with Variational self-Boosted Sampling. (arXiv:2206.15433v1 [astro-ph.IM])
    Forward modeling approaches in cosmology have made it possible to reconstruct the initial conditions at the beginning of the Universe from the observed survey data. However the high dimensionality of the parameter space still poses a challenge to explore the full posterior, with traditional algorithms such as Hamiltonian Monte Carlo (HMC) being computationally inefficient due to generating correlated samples and the performance of variational inference being highly dependent on the choice of divergence (loss) function. Here we develop a hybrid scheme, called variational self-boosted sampling (VBS) to mitigate the drawbacks of both these algorithms by learning a variational approximation for the proposal distribution of Monte Carlo sampling and combine it with HMC. The variational distribution is parameterized as a normalizing flow and learnt with samples generated on the fly, while proposals drawn from it reduce auto-correlation length in MCMC chains. Our normalizing flow uses Fourier space convolutions and element-wise operations to scale to high dimensions. We show that after a short initial warm-up and training phase, VBS generates better quality of samples than simple VI approaches and reduces the correlation length in the sampling phase by a factor of 10-50 over using only HMC to explore the posterior of initial conditions in 64$^3$ and 128$^3$ dimensional problems, with larger gains for high signal-to-noise data observations.  ( 3 min )
    Neural Networks can Learn Representations with Gradient Descent. (arXiv:2206.15144v1 [cs.LG])
    Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a two layer neural network outside the kernel regime by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime. Specifically, we consider the problem of learning polynomials which depend on only a few relevant directions, i.e. of the form $f^\star(x) = g(Ux)$ where $U: \R^d \to \R^r$ with $d \gg r$. When the degree of $f^\star$ is $p$, it is known that $n \asymp d^p$ samples are necessary to learn $f^\star$ in the kernel regime. Our primary result is that gradient descent learns a representation of the data which depends only on the directions relevant to $f^\star$. This results in an improved sample complexity of $n\asymp d^2 r + dr^p$. Furthermore, in a transfer learning setup where the data distributions in the source and target domain share the same representation $U$ but have different polynomial heads we show that a popular heuristic for transfer learning has a target sample complexity independent of $d$.  ( 3 min )
    Transfer Learning with Deep Tabular Models. (arXiv:2206.15306v1 [cs.LG])
    Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they learn reusable features and are easily fine-tuned in new domains. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we demonstrate that upstream data gives tabular neural networks a decisive advantage over widely used GBDT models. We propose a realistic medical diagnosis benchmark for tabular transfer learning, and we present a how-to guide for using upstream data to boost performance with a variety of tabular neural network architectures. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. Our code is available at https://github.com/LevinRoman/tabular-transfer-learning .  ( 2 min )
    Capturing Shape Information with Multi-Scale Topological Loss Terms for 3D Reconstruction. (arXiv:2203.01703v2 [cs.CV] UPDATED)
    Reconstructing 3D objects from 2D images is both challenging for our brains and machine learning algorithms. To support this spatial reasoning task, contextual information about the overall shape of an object is critical. However, such information is not captured by established loss terms (e.g. Dice loss). We propose to complement geometrical shape information by including multi-scale topological features, such as connected components, cycles, and voids, in the reconstruction loss. Our method uses cubical complexes to calculate topological features of 3D volume data and employs an optimal transport distance to guide the reconstruction process. This topology-aware loss is fully differentiable, computationally efficient, and can be added to any neural network. We demonstrate the utility of our loss by incorporating it into SHAPR, a model for predicting the 3D cell shape of individual cells based on 2D microscopy images. Using a hybrid loss that leverages both geometrical and topological information of single objects to assess their shape, we find that topological information substantially improves the quality of reconstructions, thus highlighting its ability to extract more relevant features from image datasets.  ( 3 min )
    SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning. (arXiv:2110.11395v2 [cs.LG] UPDATED)
    Pruning neural networks reduces inference time and memory costs. On standard hardware, these benefits will be especially prominent if coarse-grained structures, like feature maps, are pruned. We devise two novel saliency-based methods for second-order structured pruning (SOSP) which include correlations among all structures and layers. Our main method SOSP-H employs an innovative second-order approximation, which enables saliency evaluations by fast Hessian-vector products. SOSP-H thereby scales like a first-order method despite taking into account the full Hessian. We validate SOSP-H by comparing it to our second method SOSP-I that uses a well-established Hessian approximation, and to numerous state-of-the-art methods. While SOSP-H performs on par or better in terms of accuracy, it has clear advantages in terms of scalability and efficiency. This allowed us to scale SOSP-H to large-scale vision tasks, even though it captures correlations across all layers of the network. To underscore the global nature of our pruning methods, we evaluate their performance not only by removing structures from a pretrained network, but also by detecting architectural bottlenecks. We show that our algorithms allow to systematically reveal architectural bottlenecks, which we then remove to further increase the accuracy of the networks.  ( 3 min )
    Counterfactual Inference of Second Opinions. (arXiv:2203.08653v2 [cs.LG] UPDATED)
    Automated decision support systems that are able to infer second opinions from experts can potentially facilitate a more efficient allocation of resources; they can help decide when and from whom to seek a second opinion. In this paper, we look at the design of this type of support systems from the perspective of counterfactual inference. We focus on a multiclass classification setting and first show that, if experts make predictions on their own, the underlying causal mechanism generating their predictions needs to satisfy a desirable set invariant property. Further, we show that, for any causal mechanism satisfying this property, there exists an equivalent mechanism where the predictions by each expert are generated by independent sub-mechanisms governed by a common noise. This motivates the design of a set invariant Gumbel-Max structural causal model where the structure of the noise governing the sub-mechanisms underpinning the model depends on an intuitive notion of similarity between experts which can be estimated from data. Experiments on both synthetic and real data show that our model can be used to infer second opinions more accurately than its non-causal counterpart.  ( 2 min )
    Verification and search algorithms for causal DAGs. (arXiv:2206.15374v1 [cs.LG])
    We study two problems related to recovering causal graphs from interventional data: (i) $\textit{verification}$, where the task is to check if a purported causal graph is correct, and (ii) $\textit{search}$, where the task is to recover the correct causal graph. For both, we wish to minimize the number of interventions performed. For the first problem, we give a characterization of a minimal sized set of atomic interventions that is necessary and sufficient to check the correctness of a claimed causal graph. Our characterization uses the notion of $\textit{covered edges}$, which enables us to obtain simple proofs and also easily reason about earlier results. We also generalize our results to the settings of bounded size interventions and node-dependent interventional costs. For all the above settings, we provide the first known provable algorithms for efficiently computing (near)-optimal verifying sets on general graphs. For the second problem, we give a simple adaptive algorithm based on graph separators that produces an atomic intervention set which fully orients any essential graph while using $\mathcal{O}(\log n)$ times the optimal number of interventions needed to $\textit{verify}$ (verifying size) the underlying DAG on $n$ vertices. This approximation is tight as $\textit{any}$ search algorithm on an essential line graph has worst case approximation ratio of $\Omega(\log n)$ with respect to the verifying size. With bounded size interventions, each of size $\leq k$, our algorithm gives an $\mathcal{O}(\log n \cdot \log \log k)$ factor approximation. Our result is the first known algorithm that gives a non-trivial approximation guarantee to the verifying size on general unweighted graphs and with bounded size interventions.  ( 3 min )
    Towards out of distribution generalization for problems in mechanics. (arXiv:2206.14917v1 [stat.ML])
    There has been a massive increase in research interest towards applying data driven methods to problems in mechanics. While traditional machine learning (ML) methods have enabled many breakthroughs, they rely on the assumption that the training (observed) data and testing (unseen) data are independent and identically distributed (i.i.d). Thus, traditional ML approaches often break down when applied to real world mechanics problems with unknown test environments and data distribution shifts. In contrast, out-of-distribution (OOD) generalization assumes that the test data may shift (i.e., violate the i.i.d. assumption). To date, multiple methods have been proposed to improve the OOD generalization of ML methods. However, because of the lack of benchmark datasets for OOD regression problems, the efficiency of these OOD methods on regression problems, which dominate the mechanics field, remains unknown. To address this, we investigate the performance of OOD generalization methods for regression problems in mechanics. Specifically, we identify three OOD problems: covariate shift, mechanism shift, and sampling bias. For each problem, we create two benchmark examples that extend the Mechanical MNIST dataset collection, and we investigate the performance of popular OOD generalization methods on these mechanics-specific regression problems. Our numerical experiments show that in most cases, while the OOD generalization algorithms perform better compared to traditional ML methods on these OOD problems, there is a compelling need to develop more robust OOD generalization methods that are effective across multiple OOD scenarios. Overall, we expect that this study, as well as the associated open access benchmark datasets, will enable further development of OOD generalization methods for mechanics specific regression problems.  ( 3 min )
    Lookback for Learning to Branch. (arXiv:2206.14987v1 [cs.LG])
    The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound (B&B) solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.  ( 3 min )
    Decision Forest Based EMG Signal Classification with Low Volume Dataset Augmented with Random Variance Gaussian Noise. (arXiv:2206.14947v1 [q-bio.NC])
    Electromyography signals can be used as training data by machine learning models to classify various gestures. We seek to produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience while comparing the effect of our feature extraction results on model accuracy to other more conventional methods such as the use of AR parameters on a sliding window across the channels of a signal. We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting where EMG classification is being conducted, as opposed to more complicated methods such as the use of the Fourier Transform. To augment our limited training data, we used a standard technique, known as jitter, where random noise is added to each observation in a channel wise manner. Once all datasets were produced using the above methods, we performed a grid search with Random Forest and XGBoost to ultimately create a high accuracy model. For human computer interface purposes, high accuracy classification of EMG signals is of particular importance to their functioning and given the difficulty and cost of amassing any sort of biomedical data in a high volume, it is valuable to have techniques that can work with a low amount of high-quality samples with less expensive feature extraction methods that can reliably be carried out in an online application.  ( 3 min )
    Meta-analysis of heterogeneous data: integrative sparse regression in high-dimensions. (arXiv:1912.11928v2 [stat.ME] UPDATED)
    We consider the task of meta-analysis in high-dimensional settings in which the data sources are similar but non-identical. To borrow strength across such heterogeneous datasets, we introduce a global parameter that emphasizes interpretability and statistical efficiency in the presence of heterogeneity. We also propose a one-shot estimator of the global parameter that preserves the anonymity of the data sources and converges at a rate that depends on the size of the combined dataset. For high-dimensional linear model settings, we demonstrate the superiority of our identification restrictions in adapting to a previously seen data distribution as well as predicting for a new/unseen data distribution. Finally, we demonstrate the benefits of our approach on a large-scale drug treatment dataset involving several different cancer cell-lines.  ( 2 min )
    A note on large deviations for interacting particle dynamics for finding mixed equilibria in zero-sum games. (arXiv:2206.15177v1 [stat.ML])
    Finding equilibria points in continuous minimax games has become a key problem within machine learning, in part due to its connection to the training of generative adversarial networks. Because of existence and robustness issues, recent developments have shifted from pure equilibria to focusing on mixed equilibria points. In this note we consider a method proposed by Domingo-Enrich et al. for finding mixed equilibria in two-layer zero-sum games. The method is based on entropic regularisation and the two competing strategies are represented by two sets of interacting particles. We show that the sequence of empirical measures of the particle system satisfies a large deviation principle as the number of particles grows to infinity, and how this implies convergence of the empirical measure and the associated Nikaid\^o-Isoda error, complementing existing law of large numbers results.  ( 2 min )
    Interpretable Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory Models. (arXiv:2206.15316v1 [cs.LG])
    We propose a novel anomaly detection method for echocardiogram videos. The introduced method takes advantage of the periodic nature of the heart cycle to learn different variants of a variational latent trajectory model (TVAE). The models are trained on the healthy samples of an in-house dataset of infant echocardiogram videos consisting of multiple chamber views to learn a normative prior of the healthy population. During inference, maximum a posteriori (MAP) based anomaly detection is performed to detect out-of-distribution samples in our dataset. The proposed method reliably identifies severe congenital heart defects, such as Ebstein's Anomaly or Shonecomplex. Moreover, it achieves superior performance over MAP-based anomaly detection with standard variational autoencoders on the task of detecting pulmonary hypertension and right ventricular dilation. Finally, we demonstrate that the proposed method provides interpretable explanations of its output through heatmaps which highlight the regions corresponding to anomalous heart structures.  ( 2 min )
    Chained Generalisation Bounds. (arXiv:2203.00977v2 [stat.ML] UPDATED)
    This work discusses how to derive upper bounds for the expected generalisation error of supervised learning algorithms by means of the chaining technique. By developing a general theoretical framework, we establish a duality between generalisation bounds based on the regularity of the loss function, and their chained counterparts, which can be obtained by lifting the regularity assumption from the loss onto its gradient. This allows us to re-derive the chaining mutual information bound from the literature, and to obtain novel chained information-theoretic generalisation bounds, based on the Wasserstein distance and other probability metrics. We show on some toy examples that the chained generalisation bound can be significantly tighter than its standard counterpart, particularly when the distribution of the hypotheses selected by the algorithm is very concentrated. Keywords: Generalisation bounds; Chaining; Information-theoretic bounds; Mutual information; Wasserstein distance; PAC-Bayes.  ( 2 min )
    Business analytics meets artificial intelligence: Assessing the demand effects of discounts on Swiss train tickets. (arXiv:2105.01426v4 [econ.GN] UPDATED)
    We assess the demand effects of discounts on train tickets issued by the Swiss Federal Railways, the so-called `supersaver tickets', based on machine learning, a subfield of artificial intelligence. Considering a survey-based sample of buyers of supersaver tickets, we investigate which customer- or trip-related characteristics (including the discount rate) predict buying behavior, namely: booking a trip otherwise not realized by train, buying a first- rather than second-class ticket, or rescheduling a trip (e.g.\ away from rush hours) when being offered a supersaver ticket. Predictive machine learning suggests that customer's age, demand-related information for a specific connection (like departure time and utilization), and the discount level permit forecasting buying behavior to a certain extent. Furthermore, we use causal machine learning to assess the impact of the discount rate on rescheduling a trip, which seems relevant in the light of capacity constraints at rush hours. Assuming that (i) the discount rate is quasi-random conditional on our rich set of characteristics and (ii) the buying decision increases weakly monotonically in the discount rate, we identify the discount rate's effect among `always buyers', who would have traveled even without a discount, based on our survey that asks about customer behavior in the absence of discounts. We find that on average, increasing the discount rate by one percentage point increases the share of rescheduled trips by 0.16 percentage points among always buyers. Investigating effect heterogeneity across observables suggests that the effects are higher for leisure travelers and during peak hours when controlling several other characteristics.  ( 3 min )
    Provably Efficient Reinforcement Learning for Online Adaptive Influence Maximization. (arXiv:2206.14846v1 [cs.LG])
    Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes. Recent studies followed a non-adaptive setting, where the seed nodes are selected before the start of the diffusion process and network parameters are updated when the diffusion stops. We consider an adaptive version of content-dependent online influence maximization problem where the seed nodes are sequentially activated based on real-time feedback. In this paper, we formulate the problem as an infinite-horizon discounted MDP under a linear diffusion process and present a model-based reinforcement learning solution. Our algorithm maintains a network model estimate and selects seed users adaptively, exploring the social network while improving the optimal policy optimistically. We establish $\widetilde O(\sqrt{T})$ regret bound for our algorithm. Empirical evaluations on synthetic network demonstrate the efficiency of our algorithm.  ( 2 min )
    Shifts 2.0: Extending The Dataset of Real Distributional Shifts. (arXiv:2206.15407v1 [cs.LG])
    Distributional shift, or the mismatch between training and deployment data, is a significant obstacle to the usage of machine learning in high-stakes industrial applications, such as autonomous driving and medicine. This creates a need to be able to assess how robustly ML models generalize as well as the quality of their uncertainty estimates. Standard ML baseline datasets do not allow these properties to be assessed, as the training, validation and test data are often identically distributed. Recently, a range of dedicated benchmarks have appeared, featuring both distributionally matched and shifted data. Among these benchmarks, the Shifts dataset stands out in terms of the diversity of tasks as well as the data modalities it features. While most of the benchmarks are heavily dominated by 2D image classification tasks, Shifts contains tabular weather forecasting, machine translation, and vehicle motion prediction tasks. This enables the robustness properties of models to be assessed on a diverse set of industrial-scale tasks and either universal or directly applicable task-specific conclusions to be reached. In this paper, we extend the Shifts Dataset with two datasets sourced from industrial, high-risk applications of high societal importance. Specifically, we consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels. Both tasks feature ubiquitous distributional shifts and a strict safety requirement due to the high cost of errors. These new datasets will allow researchers to further explore robust generalization and uncertainty estimation in new situations. In this work, we provide a description of the dataset and baseline results for both tasks.  ( 3 min )
    Universal and data-adaptive algorithms for model selection in linear contextual bandits. (arXiv:2111.04688v2 [cs.LG] UPDATED)
    Model selection in contextual bandits is an important complementary problem to regret minimization with respect to a fixed model class. We consider the simplest non-trivial instance of model-selection: distinguishing a simple multi-armed bandit problem from a linear contextual bandit problem. Even in this instance, current state-of-the-art methods explore in a suboptimal manner and require strong "feature-diversity" conditions. In this paper, we introduce new algorithms that a) explore in a data-adaptive manner, and b) provide model selection guarantees of the form $\mathcal{O}(d^{\alpha} T^{1- \alpha})$ with no feature diversity conditions whatsoever, where $d$ denotes the dimension of the linear model and $T$ denotes the total number of rounds. The first algorithm enjoys a "best-of-both-worlds" property, recovering two prior results that hold under distinct distributional assumptions, simultaneously. The second removes distributional assumptions altogether, expanding the scope for tractable model selection. Our approach extends to model selection among nested linear contextual bandits under some additional assumptions.  ( 2 min )
    Prediction of Dilatory Behavior in eLearning: A Comparison of Multiple Machine Learning Models. (arXiv:2206.15079v1 [stat.ML])
    Procrastination, the irrational delay of tasks, is a common occurrence in online learning. Potential negative consequences include higher risk of drop-outs, increased stress, and reduced mood. Due to the rise of learning management systems and learning analytics, indicators of such behavior can be detected, enabling predictions of future procrastination and other dilatory behavior. However, research focusing on such predictions is scarce. Moreover, studies involving different types of predictors and comparisons between the predictive performance of various methods are virtually non-existent. In this study, we aim to fill these research gaps by analyzing the performance of multiple machine learning algorithms when predicting the delayed or timely submission of online assignments in a higher education setting with two categories of predictors: subjective, questionnaire-based variables and objective, log-data based indicators extracted from a learning management system. The results show that models with objective predictors consistently outperform models with subjective predictors, and a combination of both variable types perform slightly better. For each of these three options, a different approach prevailed (Gradient Boosting Machines for the subjective, Bayesian multilevel models for the objective, and Random Forest for the combined predictors). We conclude that careful attention should be paid to the selection of predictors and algorithms before implementing such models in learning management systems.  ( 3 min )
    Which Minimizer Does My Neural Network Converge To?. (arXiv:2011.02408v2 [stat.ML] UPDATED)
    The loss surface of an overparameterized neural network (NN) possesses many global minima of zero training error. We explain how common variants of the standard NN training procedure change the minimizer obtained. First, we make explicit how the size of the initialization of a strongly overparameterized NN affects the minimizer and can deteriorate its final test performance. We propose a strategy to limit this effect. Then, we demonstrate that for adaptive optimization such as AdaGrad, the obtained minimizer generally differs from the gradient descent (GD) minimizer. This adaptive minimizer is changed further by stochastic mini-batch training, even though in the non-adaptive case, GD and stochastic GD result in essentially the same minimizer. Lastly, we explain that these effects remain relevant for less overparameterized NNs. While overparameterization has its benefits, our work highlights that it induces sources of error absent from underparameterized models.  ( 2 min )
    Federated Over-Air Subspace Tracking from Incomplete and Corrupted Data. (arXiv:2002.12873v4 [cs.LG] UPDATED)
    In this work we study the problem of Subspace Tracking with missing data (ST-miss) and outliers (Robust ST-miss). We propose a novel algorithm, and provide a guarantee for both these problems. Unlike past work on this topic, the current work does not impose the piecewise constant subspace change assumption. Additionally, the proposed algorithm is much simpler (uses fewer parameters) than our previous work. Secondly, we extend our approach and its analysis to provably solving these problems when the data is federated and when the over-air data communication modality is used for information exchange between the $K$ peer nodes and the center. We validate our theoretical claims with extensive numerical experiments.  ( 2 min )
    Wasserstein GANs with Gradient Penalty Compute Congested Transport. (arXiv:2109.00528v2 [cs.LG] UPDATED)
    Wasserstein GANs with Gradient Penalty (WGAN-GP) are a very popular method for training generative models to produce high quality synthetic data. While WGAN-GP were initially developed to calculate the Wasserstein 1 distance between generated and real data, recent works (e.g. [23]) have provided empirical evidence that this does not occur, and have argued that WGAN-GP perform well not in spite of this issue, but because of it. In this paper we show for the first time that WGAN-GP compute the minimum of a different optimal transport problem, the so-called congested transport [7]. Congested transport determines the cost of moving one distribution to another under a transport model that penalizes congestion. For WGAN-GP, we find that the congestion penalty has a spatially varying component determined by the sampling strategy used in [12] which acts like a local speed limit, making congestion cost less in some regions than others. This aspect of the congested transport problem is new, in that the congestion penalty turns out to be unbounded and depends on the distributions to be transported, and so we provide the necessary mathematical proofs for this setting. One facet of our discovery is a formula connecting the gradient of solutions to the optimization problem in WGAN-GP to the time averaged momentum of the optimal mass flow. This is in contrast to the gradient of Kantorovich potentials for the Wasserstein 1 distance, which is just the normalized direction of flow. Based on this and other considerations, we speculate on how our results explain the observed performance of WGAN-GP. Beyond applications to GANs, our theorems also point to the possibility of approximately solving large scale congested transport problems using neural network techniques.  ( 3 min )
    Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra. (arXiv:2206.15397v1 [cs.LG])
    K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very time-consuming (or even prohibitive) when these factors are large. In this paper, we theoretically show that, owing to the exponential-average construction paradigm of the Kronecker factors that is typically used, their eigen-spectrum must decay. We show numerically that in practice this decay is very rapid, leading to the idea that we could save substantial computation by only focusing on the first few eigen-modes when inverting the Kronecker-factors. Randomized Numerical Linear Algebra provides us with the necessary tools to do so. Numerical results show we obtain $\approx2.5\times$ reduction in per-epoch time and $\approx3.3\times$ reduction in time to target accuracy. We compare our proposed K-FAC sped-up versions with a more computationally efficient NG implementation, SENG, and observe we perform on par with it.  ( 2 min )
    Best of Both Worlds Model Selection. (arXiv:2206.14912v1 [cs.LG])
    We study the problem of model selection in bandit scenarios in the presence of nested policy classes, with the goal of obtaining simultaneous adversarial and stochastic ("best of both worlds") high-probability regret guarantees. Our approach requires that each base learner comes with a candidate regret bound that may or may not hold, while our meta algorithm plays each base learner according to a schedule that keeps the base learner's candidate regret bounds balanced until they are detected to violate their guarantees. We develop careful mis-specification tests specifically designed to blend the above model selection criterion with the ability to leverage the (potentially benign) nature of the environment. We recover the model selection guarantees of the CORRAL algorithm for adversarial environments, but with the additional benefit of achieving high probability regret bounds, specifically in the case of nested adversarial linear bandits. More importantly, our model selection results also hold simultaneously in stochastic environments under gap assumptions. These are the first theoretical results that achieve best of both world (stochastic and adversarial) guarantees while performing model selection in (linear) bandit scenarios.  ( 2 min )
    Learning Nonparametric Ordinary differential Equations: Application to Sparse and Noisy Data. (arXiv:2206.15215v1 [stat.ML])
    Learning nonparametric systems of Ordinary Differential Equations (ODEs) $\dot x = f(t,x)$ from noisy and sparse data is an emerging machine learning topic. We use the well-developed theory of Reproducing Kernel Hilbert Spaces (RKHS) to define candidates for $f$ for which the solution of the ODE exists and is unique. Learning $f$ consists of solving a constrained optimization problem in an RKHS. We propose a penalty method that iteratively uses the Representer theorem and Euler approximations to provide a numerical solution. We prove a generalization bound for the $L^2$ distance between $x$ and its estimator. Experiments are provided for the FitzHugh Nagumo oscillator and for the prediction of the Amyloid level in the cortex of aging subjects. In both cases, we show competitive results when compared with the state of the art.  ( 2 min )
    Fair Policy Targeting. (arXiv:2005.12395v3 [econ.EM] UPDATED)
    One of the major concerns of targeting interventions on individuals in social welfare programs is discrimination: individualized treatments may induce disparities across sensitive attributes such as age, gender, or race. This paper addresses the question of the design of fair and efficient treatment allocation rules. We adopt the non-maleficence perspective of first do no harm: we select the fairest allocation within the Pareto frontier. We cast the optimization into a mixed-integer linear program formulation, which can be solved using off-the-shelf algorithms. We derive regret bounds on the unfairness of the estimated policy function and small sample guarantees on the Pareto frontier under general notions of fairness. Finally, we illustrate our method using an application from education economics.  ( 2 min )
  • Open

    Phi Phi
    I was reading something this afternoon and ran across φ(φ(m)) and thought that was unusual. I often run across φ(m), the number of positive integers less than m and relative prime to m, but don’t often see Euler’s phi function iterated. Application of φ∘φ This section will give an example of a theorem where φ(φ(m)) […] Phi Phi first appeared on John D. Cook.  ( 5 min )

  • Open

    [R] Proprietary ML model in research paper
    I am writing a research paper, and in it I use a proprietary ML model I made. I want to show the model's results and I can explain how it works, but I don't want to explicitly provide the model/its code. Is that commonplace in research papers or must I include specifics to show validity? submitted by /u/Typical-Ad-7443 [link] [comments]  ( 87 min )
    [D][P] Ideas about how to model from a dataset with columns containing arrays of data?
    Hello. I have built a dataset that contains results of experiments I have been doing over some physical materials. Each row contains summary data for each piece, like width, height, weight, etc. Then I have several columns which values are arrays. Each one of these columns contain a list of tuples, for example (162636363, 1373.8377). The first number is a timestamp, the second one the magnitude of a force applied to the material (or, for instance, the position where the force was applied, contact duration, etc.). We have hundreds or even thousands of tuples on each column. So, all columns represent measurements of the experiments done to a particular material. We are recording when the material is damaged, since we want to predict its lifetime when the material is exposed to repetitive forces. I'm wondering what to do with those array values. One option is to sort the tuples lists by timestamp and then treat the readings as a vector of a predefined dimension. But I have never fed this kind of data to a boosted tree model/framework like XGBoost. The only experience I had feeding long vectors to a model was when doing some NLP, in that case the vectors were representations of words. Do you think a vector made of all my experiments over a material can be treated as an embedding in a way? If so, how is the recommended way to proceed with this data in the modeling stage? Time series perhaps? I'd appreciate your ideas and comments. Thanks!! submitted by /u/iblysa [link] [comments]  ( 87 min )
    [D] Usage of the [class] token in ViT
    So I've read up on ViT, and while it's an impressive architecture, I seem to notice that they are using a [class] token to get the actual class from an input image (see image below). ​ Architecture of ViT While I know that it's standard to use an extra token in this fashion, since the encoder spits out one embedding for every input token (or patch in this case), I was wondering why don't we simply concatenate all the embeddings before feeding them into the MLP head (of an appropriate size)? ​ It seems to me like we are discarding a lot of information here, that could be helpful in the classification task. It's true, in theory, that the attention should take care of that, but do you know of any papers where this concatenation strategy has been tried? Does it even make sense? ​ Cheers! submitted by /u/MurlocXYZ [link] [comments]  ( 87 min )
    [D] Creating a neural network for my daughter's sake. Need advice on acronym.
    Hi, very long time lurker here. I'm planning to propose an end to end architecture for my daughter's sake. Data is biomedical and any CNN is well capable of classfying if over %95 Acc (easy data u know!). However, I need to come up with an acronym to fit my daughter's name. Her name is DURU and here is what I come up with: D- Deep (Deep like you know, deep learning) U-Unified (I may use multiple models to form up an ensemble or feature concat, which will make it unified) R- Residual (I may use residual connections between Cnn blocks. Though not flashy right now.) R- Recommender (Could use recommender keyword, since I'm putting down sort of a Computer Aided Diagnosis Framework thingy) R- Another R thing is welcome. U - I need another U and I'm totally out of words. Three letters is all I came up with. Couldn't find a word for the 4th letter that makes sense. U-net? I'm not segmentating anything. But if it was a segmentation dataset I may have come up with DUR-UNet which would make sense. I need a final keyword starting with U which is applicable with CNNs. It could be minor trick to cope with overfitting, a loss function, an activation function, etc. It could also be a filler term like Unified. Hope we could come up with a solution. submitted by /u/cltexe [link] [comments]  ( 87 min )
    [R] Introducing causal inference in the energy-efficient building design process
    I am very excited to share our latest research: Causal inference in the scenario of an energy-efficient building design to answer "what-if" questions during the design process. Abs: "What-if" questions are intuitively generated and commonly asked during the design process. Engineers and architects need to inherently conduct design decisions, progressing from one phase to another. They either use empirical domain experience, simulations, or data-driven methods to provide consequential feedback. We take an example from an interdisciplinary domain of energy-efficient building design to argue that the current methods for decision support have four limitations: 1. Less carefully inspected parametric independence raises the risks of biased results and spurious relationships. 2. The integration …  ( 88 min )
    [P]how to improve performance of face recognition using dlib?
    I am using dlib.get_frontal_face_detector() And fir large images (several mb) it takes a lot of time to detect a face. What are the ways to increase speed of face detection, without sacrificing accuracy? I cannot use gpu/cuda sadly... submitted by /u/glorsh66 [link] [comments]  ( 86 min )
    [D] Moody Actor Critic
    Generally actor critic algorithms have 1 Neural net giving 1 of each via a linear layer - to give a policy and to give the value. But humans change decisions and how they think based on their mood. I wanted to incorporate this into a standard actor critic like A2C/A3C. I wanted to add another actor in this architecture that represented a certain mood, where it's objective was not to maximize the reward but something else that I have in mind. I don't see any such literature in the field and I don't know how to add more actors. Is it not possible to have multiple actors with one critic ? Has this been passed on by the community for a lack of potential ? submitted by /u/darthsocker [link] [comments]  ( 87 min )
    [P] Upgini 1.0 is released (a Python library for data search through autoML )
    Upgini is a simple feature search & enrichment library in Python. With Upgini, you spend less time for external data search and feature engineering, which will be done for you automatically. Just use your labeled dataset to initiate search through thousands of features and data sources, including public datasets and scraped data shared by Data science community. Only the relevant features that improve prediction power of your ML model are returned. Motivation: for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient search tools for external data blocks massive adoption of external features in ML pipelines.We want radically simplify features search and delivery for ML pipelines to make external data a standard approach. Like a hyperparameter tuning for machine learning nowadays. Mission: Democratize access to data sources for data science community. 📊 Data coverage and statistics Total: 239 countries and up to 41 years of history https://preview.redd.it/oj87fnkw9s891.png?width=1220&format=png&auto=webp&s=4195a607addca12d400bc4b0b62307ac4db87b67 More info about the library To install Upgini from PyPI run pip install -U upgini Full release notes: https://github.com/upgini/upgini Try the online demo at Colab. submitted by /u/AnnualLimp1418 [link] [comments]  ( 87 min )
    [P] Albumentations 1.2 is released (a Python library for image augmentation)
    The new release of a fast and flexible library for image augmentation includes: New augmentations: UnsharpMask sharpens the input image using Unsharp Masking processing and overlays the result with the original image. PixelDropout randomly replaces pixels with the passed value. https://preview.redd.it/ic1nm7mw3s891.png?width=942&format=png&auto=webp&s=c95e319f26a19bad42d33fd84e0ef27703db9095 RingingOvershoot creates ringing or overshoot artifacts by convolving the image with a 2D sinc filter. AdvancedBlur blurs the input image using a Generalized Normal filter with randomly selected parameters. It also adds multiplicative noise to generated kernel before convolution. https://preview.redd.it/wb1v6vyw3s891.png?width=941&format=png&auto=webp&s=2e57ae1b583b7aab7c30125058a25ee296afd2bd Improvements and bug fixes Fixed all np.random use cases to prevent identical values when using multiprocessing. Also, we fixed corner cases and made improvements for many augmentations. Release notes Full release notes are available at https://github.com/albumentations-team/albumentations/releases/tag/1.2.0 Installation As always, you can install the latest version of the library by running: pip install -U albumentations submitted by /u/alexparinov [link] [comments]  ( 87 min )
    [P] Sharing an Interactive Research Demo on the Cloud
    I am curious to hear what you usually use to develop interactive versions of your research models! And, if you have any, I'd be excited to see some examples for inspiration 😊. On that note, about 2 weeks ago, I shared an article on developing a Super-Resolution GAN Research Demo in Lightning on r/MachineLearning: Bottom-up look at the new Lightning Framework for building anything from production-ready ML systems to research demos Running research demos locally is not super useful by itself (unless you maybe do that live at a poster session -- I still have painful memories of doing that with privacy GAN demo in Flask), so this is a follow-up article on deploying the App on the Cloud: Sharing Deep Learning Research Models with Lightning Part 2: Leveraging the Cloud. Also, curious to hear what you think! submitted by /u/seraschka [link] [comments]  ( 88 min )
    [N][R][CfP] Workshop on Artificial Intelligence for Strategy Games @ AIIDE 22
    Hello Everyone! My name is Derek, and I am a co-chair for the Workshop on AI for Strategy Games at AIIDE this year. I wanted to share some info about the workshop for those that may be interested in discussing the future of AI for strategy games or looking to publish/get feedback on any work research you are doing with strategy games. Feel free to message me if you have any questions! Workshop website: https://skatgame.net/mburo/aiide22ws/ Submission deadline: July 29, 2022 Topics This workshop welcomes original research contributions, position papers, competition AI system descriptions, and post-mortem game analyses in the area of AI for strategy games --- including modern video strategy games (such as FPS and RTS games), and turn based games and puzzles. Topics include, but are not r…  ( 89 min )
    [D] What is considered a "large" model?
    Curious about the usage of the word "large" in the research community and in papers as a descriptor. About 3 years ago, Bert-Large was considered large at 345 million parameters. Today we have a 11-B parameter T-5 model and larger. When describing models in papers, is there consensus as to what we consider a "large" model or set of categories to describe models based on their size? submitted by /u/certain_entropy [link] [comments]  ( 88 min )
    [D] Are there still any SOTA architectures trainable from-scratch for a student ?
    When I say "SOTA" I'm talking about recent architectures like ViT, BERT, GPT-like models.. Is it possible to train any of these from scratch (no pre-trained checkpoint) with low resources (Colab, Colab pro) ? submitted by /u/Silver_Doughnut_8175 [link] [comments]  ( 91 min )
    [D] Loss Function, Uncertainty
    Hello members, soo my question is suppose we have a model or architecture at we have an image classifier at the end of it which is trained on mnist images. We need to train the model such that when the image is passed through the classifier it outcomes it's results with some uncertainty in its predictions. We need to use that uncertainty in order to develop a loss function to train the whole model as we can't use the true labels of the images. Any resources or ideas related to above which can be helpful pls share with me. Any suggestions will be appreciated. Thanks submitted by /u/Anonymous_Guy_12 [link] [comments]  ( 86 min )
    [D] Algorithms for Anomaly Detection
    Hi guys, I am dealing with 1000s of devices distributed over the whole world. These devices log and upload events (e.g. various kinds of device faults) including a time stamp to a database. My tasks now is to analyze these time series of events, detect anomalies and then automatically send notifications about these anomalies. Anomalies I want to detect may include things like: - sudden spikes in the number of events - sudden changes of the type of events - long term drift of the number of events - etc. etc. Any advice on suitable algorithms for this kind of problems and/or according literature would be highly appreciated. Thanks! 👍 submitted by /u/RafiRafiRafiRafi [link] [comments]  ( 93 min )
    [P] [R] Automated Essay Scoring Systems for other languages
    Hey guys, working on an AES project. Just wanted to know if there exists an AES system that can be trained on languages like Swahili, Arabic, Hindi etc. Languages having almost no AES studies done. Would be very helpful of you to guide me through, any other tips/pointers towards this task are much appreciated, would love it if someone can point me in the right direction. submitted by /u/NeoKoseii [link] [comments]  ( 86 min )
    [R] RankSEG: A Consistent Ranking-based Framework for Segmentation
    I am very excited to share our latest research: a new framework RankSEG on (image) segmentation. Abs: In this paper, we establish a theoretical foundation of segmentation with respect to the Dice/IoU metrics, including the Bayes rule and Dice/IoU-calibration, analogous to classification-calibration or Fisher consistency in classification. We prove that the existing thresholding-based framework with most operating losses are NOT consistent with respect to the Dice/IoU metrics, and thus may lead to a suboptimal solution. To address this pitfall, we propose a novel consistent ranking-based framework, namely RankDice/RankIoU, inspired by plug-in rules of the Bayes segmentation rule. Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in lar…  ( 88 min )
    [D] Why are transformers still being used?
    We already have architecture(s) which are supposed to fix one of the biggest issues with transformers, namely that they scale quadratically with input size. The performer scales linearly, which should allow for much bigger context windows, yet looking at recent large language models from major players, all of them seem to be using the old transformer save for some minor improvements. The only exception was Flamingo which had to use a Perceiver because images are huge. So why haven't we ditched the transformer yet? submitted by /u/DickMan64 [link] [comments]  ( 92 min )
    [N] Introducing Anomalib: A library for benchmarking, developing and deploying deep learning anomaly detection algorithms by Intel
    Anomalib is Machine Library developed by AI researchers from Intel which implements state of the art algorithms for anomaly detection. Anomaly detection is popular use case in the industrial sector and such algorithms can help provide real-time feedback to manufactures on how well their production lines are performing. Anomaly Detection is a challenging problem often due to a biased dataset. Anomalous images can be scare therefore these algorithms are trained on good images in an unsupervised fashion. By learning the normality, upon inference, the models can detect whether images are anomalous or not. Anomalib was built using a PyTorchLightning Backbone and offers an easy way to deploy the models with OpenVino for inference speedup. Link to the github repo: https://github.com/openvinotoolkit/anomalib Link to a tutorial on how to train your custom dataset with anomalib: https://github.com/openvinotoolkit/anomalib/tree/development/docs/blog/001-train-custom-dataset Please feel free to check out the repo and give us your feedback submitted by /u/alder-ice [link] [comments]  ( 87 min )
    [D] On advisors and PhD students
    I think the answer to this question depends on heavily on the area at hand. That is why I am asking here, even though this question has been asked elsewhere a gazillion times. How much does your advisor help/contribute? How often do you meet? I am especially interested in people who have published papers. Who proposed the problem, and then found a solution? how much of that solution was joint work vs either of you submitting ideas to the other and being approved or rejected? how satisfied/dissatisfied do you feel with respect to your advisor? have you had multiple advisors? if so, how do they compare? Let me start by sharing my experience. I always take the initiative when organizing a meeting with my advisor; if I don't say anything, we probably wouldn't meet. I send him biweekly emails with my progress. Usually this entails a write-up explaining my ideas and their development. I think he skims through it, but he definitely does not read it carefully/go through the details. When we have a meeting I generally have to explain the content of the write-up. In terms of the content itself, he tells me whether the ideas/problem seem sound or not, but does not propose improvements. Sometimes, he proposes other ideas that would imply a significant shift of my current work, which honestly I tend to reject because I have already invested a great deal of time to my ideas and I am more emotionally attached to them (I know this latter point isn't good practice). Overall, I don't know how to feel because I don't really know what's generally expected. If I had to chose, however, I'd say I feel mildly satisfied. What's your experience? submitted by /u/carlml [link] [comments]  ( 100 min )
    [Discussion] Regarding Long Term Memory in NLP Models
    Does anyone know if there exists a NLP model, like Lambda, that takes every conversation attempts to update their weights in order to incorporate it into its training? My thought process would be instead of using attention and a subsection of the conversation to generate a response, it takes everything. Basically everything gets back propagated and adjusts the weights. This way the model might begin to "remember" its previous conversations. This may be a stretch and perhaps I am missing something fundamental, but it seems like an interesting experiment. I'd love to continue this conversation and elaborate more in the comments. submitted by /u/gabe415160 [link] [comments]  ( 84 min )
    [D] emerging fields of ML that will help mankind
    Hey all, This is a question I've been asking myself lately; what are the fields of ML which show the most promise in helping mankind in non-frivolous ways (e.g. not animojis)? A few years ago I remember an article describing how one of Microsoft's object detection services helped give 'sight' to the blind by describing what's in the room around them through its computer vision. Another one that I found inspiring was about assisting those with locked-in-syndrome by mapping their brain waves and/or eye movement to certain images, words or letters (I forget which). submitted by /u/lituga [link] [comments]  ( 84 min )
  • Open

    DARK DERELICT CITY | RAW UNSCALED | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 84 min )
    This is what happens when you allow a chatbot be trained by the public.
    submitted by /u/LeglessLoach [link] [comments]  ( 83 min )
    AIs that run on language models aren't so much intelligences as world-generators. Their inner workings are mysterious; humans will need to intuit their outputs and become mystics -- called "prompt engineers" -- as a result
    submitted by /u/cold-depths [link] [comments]  ( 85 min )
    Made with Dalle-2 A.i
    submitted by /u/OneFinding1429 [link] [comments]  ( 84 min )
    pixelz.ai updates 👇🏽
    submitted by /u/pixelz_ai [link] [comments]  ( 84 min )
    How well would it work if artificial intelligence were used to summarize and rewrite a non-fiction book and have the AI automatically remove all of the author's personal experiences?
    Then you would have a whole new book, which is better to read. ​ Wouldn't that also pass the copyright from the author to the developer of the AI? ​ 2 Question: Is there an AI that really has a realistic voice, which you can use to make audio books? submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 85 min )
    looking for admins for a Discord based non-profit project. No time commitment, just want to have some AI experts to help awnser community member questions! DM for more info :)
    submitted by /u/Accomplished_Head5 [link] [comments]  ( 85 min )
    Steph Curry's former coach says AI can help train the next NBA champions
    submitted by /u/estasfuera [link] [comments]  ( 85 min )
    What happens if we put a ‘sentient’ AI inside of a lab-grown brain?
    submitted by /u/estasfuera [link] [comments]  ( 85 min )
    Open-source language AI challenges big tech’s models
    submitted by /u/bperki8 [link] [comments]  ( 85 min )
    Google's latest image Ai beats Imagen (Googles 4 week old image Ai), which itself beats Dalle 2.
    submitted by /u/Hallowmew [link] [comments]  ( 86 min )
    AI redrawls TikTok logo (https://www.craiyon.com/)
    submitted by /u/Various_Yoghurt1859 [link] [comments]  ( 84 min )
    Computing machinery and intelligence
    I was just reading the research paper "Computing machinery and Intelligence", a paper that Alan Turing had published in 1950, and I did not quite understand a section of the paper where he wrote about the parameters of the machine to be considered in the "imitation game". It would be of great help if anyone could explain these parameters , especially the third parameter. ​ https://preview.redd.it/0bcnothvzr891.png?width=630&format=png&auto=webp&s=ddee792f5c451568bc73c368c0c14c8bd09597e4 submitted by /u/Huckleberry-4915 [link] [comments]  ( 87 min )
    Altis AI Personal Trainer gamifies movement instruction
    submitted by /u/NinaMJ [link] [comments]  ( 84 min )
    AI will not destroyed humanity but rather save it!
    submitted by /u/MufBoiLegend420 [link] [comments]  ( 85 min )
    World’s Top 50 Innovators 2022
    submitted by /u/chelsea_bear [link] [comments]  ( 85 min )
    Codeformer - Face Image Restoration model
    submitted by /u/imapurplemango [link] [comments]  ( 85 min )
    "Einstein" - Created on Pixelz.ai
    ​ https://preview.redd.it/ezbng11f6q891.jpg?width=1170&format=pjpg&auto=webp&s=cecad6b5326e492cfe897d78290aec46c197d198 submitted by /u/pixelz_ai [link] [comments]  ( 84 min )
    What voice-changing apps are available right now?
    I know there's one or two options for real-time voice changing that don't sound so convincing (that is, they sound robotic). I was wondering if there's anything that might sound better but doesn't operate in real time? I plan on voicing male and female characters in a video and I have plenty of time to edit the voice clips together but I need them to sound convincing. Free stuff is preferred but I'd consider paying money if that's the only way to get good results. submitted by /u/outsm0ked [link] [comments]  ( 86 min )
  • Open

    Workshop on Artificial Intelligence for Strategy Games @ AIIDE 22
    Hello Everyone! My name is Derek, and I am a co-chair for the Workshop on AI for Strategy Games at AIIDE this year. I wanted to share some info about the workshop for those that may be interested in discussing the future of AI for strategy games or looking to publish/get feedback on any work research you are doing with strategy games. Feel free to message me if you have any questions! Workshop website: https://skatgame.net/mburo/aiide22ws/ Submission deadline: July 29, 2022 Topics This workshop welcomes original research contributions, position papers, competition AI system descriptions, and post-mortem game analyses in the area of AI for strategy games --- including modern video strategy games (such as FPS and RTS games), and turn based games and puzzles. Topics include, but are not r…  ( 87 min )
    Are exploration and credit assignment independent? What's your opinion?
    submitted by /u/Conscious_Heron_9133 [link] [comments]  ( 85 min )
    Optimal State-Value Function vs Optimal Action-Value Function
    In Sutton's book, page 63, there is this proof/statement: ​ ​ https://preview.redd.it/azjnvai76r891.png?width=711&format=png&auto=webp&s=ecc4f276a986e07247a769ec4d9df4a9d68da194 Can anyone explain or point out some reference that explains: Why is that v*(s) = max q(s,a)? How can I get Eq (3.18)? Thank you! submitted by /u/rlopes404 [link] [comments]  ( 86 min )
    Ideal size of the visual observation
    Hi, I am using the MPE (https://github.com/openai/multiagent-particle-envs) and I'm planning to use a visual observation. I was wondering, what size should it be? I assume that if it is too large and the agents are only a few, I am wasting lots of compute for nothing and also the noise becomes a lot. But how to find the best size? 60x60x3 for example? submitted by /u/No_Possibility_7588 [link] [comments]  ( 85 min )
  • Open

    Identifying Disfluencies in Natural Speech
    Posted by Dan Walker and Dan Liebling, Software Engineers, Google Research People don’t write in the same way that they speak. Written language is controlled and deliberate, whereas transcripts of spontaneous speech (like interviews) are hard to read because speech is disorganized and less fluent. One aspect that makes speech transcripts particularly difficult to read is disfluency, which includes self-corrections, repetitions, and filled pauses (e.g., words like “umm”, and “you know”). Following is an example of a spoken sentence with disfluencies from the LDC CALLHOME corpus: But that's it's not, it's not, it's, uh, it's a word play on what you just said. It takes some time to understand this sentence — the listener must filter out the extraneous words and resolve all of the nots. Remo…  ( 27 min )
    Minerva: Solving Quantitative Reasoning Problems with Language Models
    Posted by Ethan Dyer and Guy Gur-Ari, Research Scientists, Google Research, Blueshift Team Language models have demonstrated remarkable performance on a variety of natural language tasks — indeed, a general lesson from many works, including BERT, GPT-3, Gopher, and PaLM, has been that neural networks trained on diverse data at large scale in an unsupervised way can perform well on a variety of tasks. Quantitative reasoning is one area in which language models still fall far short of human-level performance. Solving mathematical and scientific questions requires a combination of skills, including correctly parsing a question with natural language and mathematical notation, recalling relevant formulas and constants, and generating step-by-step solutions involving numerical calculations and…  ( 25 min )
  • Open

    The Riemann Hypothesis in One Picture
    I wrote this article for machine learning and analytic professionals in general. Actually, I describe a new visual, simple, intuitive method for supervised classification. It involves synthetic data and explainable AI. But at the same time, I describe in layman’s terms the Riemann Hypothesis (RH). Also, I offer a new perspective on the subject for… Read More »The Riemann Hypothesis in One Picture The post The Riemann Hypothesis in One Picture appeared first on Data Science Central.  ( 21 min )
  • Open

    Secure Amazon SageMaker Studio presigned URLs Part 1: Foundational infrastructure
    You can access Amazon SageMaker Studio notebooks from the Amazon SageMaker console via AWS Identity and Access Management (IAM) authenticated federation from your identity provider (IdP), such as Okta. When a Studio user opens the notebook link, Studio validates the federated user’s IAM policy to authorize access, and generates and resolves the presigned URL for […]  ( 6 min )
    Secure Amazon SageMaker Studio presigned URLs Part 2: Private API with JWT authentication
    In part 1 of this series, we demonstrated how to resolve an Amazon SageMaker Studio presigned URL from a corporate network using Amazon private VPC endpoints without traversing the internet. In this post, we will continue to build on top of the previous solution to demonstrate how to build a private API Gateway via Amazon API […]  ( 7 min )
  • Open

    Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE
    Some things are easy as A, B, C. But when it comes to autonomous vehicles, the key may be in one, two, three. Faction, a Bay Area-based startup and NVIDIA Inception member, is preparing to debut its business-to-business autonomous delivery service, accelerating its commercial deployment with three-wheel production electric vehicles purpose-built for driverless services. In Read article > The post Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE appeared first on NVIDIA Blog.  ( 5 min )
    The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More
    Turn the TV on. GeForce NOW is leveling up gaming in the living room. The Samsung Gaming Hub launched today, delivering GeForce NOW natively on 2022 Samsung Smart TVs. Plus, the SHIELD Software Experience Upgrade 9.1 is now rolling out to all NVIDIA SHIELD TVs, delivering new gaming features that improve GeForce NOW. Great living Read article > The post The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More appeared first on NVIDIA Blog.  ( 8 min )
  • Open

    Speculation on new SI prefixes
    The SI prefixes giga and tera were adopted in 1960. The prefixes eta and peta were adopted in 1975, and zetta and yotta were adopted in 1991. Following this 15-year cadence, we should have adopted a few more prefixes by now. If we ever do introduce new prefixes, what might they be? The latest prefixes […] Speculation on new SI prefixes first appeared on John D. Cook.  ( 5 min )
  • Open

    byteLAKE’s CFD Suite (AI-accelerated CFD) — recommended hardware for AI training at the Edge (1/3)
    Blog post miniseries summarizing byteLAKE’s recommendation about hardware platforms to perform CFD Suite’s AI Training at the Edge.  ( 15 min )
  • Open

    FIGS: Attaining XGBoost-level performance with the interpretability and speed of CART
    FIGS (Fast Interpretable Greedy-tree Sums): A method for building interpretable models by simultaneously growing an ensemble of decision trees in competition with one another. Recent machine-learning advances have led to increasingly complex predictive models, often at the cost of interpretability. We often need interpretability, particularly in high-stakes applications such as in clinical decision-making; interpretable models help with all kinds of things, such as identifying errors, leveraging domain knowledge, and making speedy predictions. In this blog post we’ll cover FIGS, a new method for fitting an interpretable model that takes the form of a sum of trees. Real-world experiments and theoretical results show that FIGS can effectively adapt to a wide range of structure in data, ach…  ( 3 min )
  • Open

    How do I train a neural network on a small dataset?
    I have a dataset with 7 input features and 1 output.The length of the dataset is just 260, which is small. How can I train a neural network with the help of keras and achieve accuracy of over 80%? What should be the architecture of the deep neural network? submitted by /u/mono1110 [link] [comments]  ( 85 min )
  • Open

    Building explainability into the components of machine-learning models
    Researchers develop tools to help data scientists make the features used in machine-learning models more understandable for end users.  ( 7 min )
  • Open

    Generalized Permutants and Graph GENEOs. (arXiv:2206.14798v1 [math.CO])
    In this paper we establish a bridge between Topological Data Analysis and Geometric Deep Learning, adapting the topological theory of group equivariant non-expansive operators (GENEOs) to act on the space of all graphs weighted on vertices or edges. This is done by showing how the general concept of GENEO can be used to transform graphs and to give information about their structure. This requires the introduction of the new concepts of generalized permutant and generalized permutant measure and the mathematical proof that these concepts allow us to build GENEOs between graphs. An experimental section concludes the paper, illustrating the possible use of our operators to extract information from graphs. This paper is part of a line of research devoted to developing a compositional and geometric theory of GENEOs for Geometric Deep Learning.  ( 2 min )
    DrumGAN VST: A Plugin for Drum Sound Analysis/Synthesis With Autoencoding Generative Adversarial Networks. (arXiv:2206.14723v1 [cs.SD])
    In contemporary popular music production, drum sound design is commonly performed by cumbersome browsing and processing of pre-recorded samples in sound libraries. One can also use specialized synthesis hardware, typically controlled through low-level, musically meaningless parameters. Today, the field of Deep Learning offers methods to control the synthesis process via learned high-level features and allows generating a wide variety of sounds. In this paper, we present DrumGAN VST, a plugin for synthesizing drum sounds using a Generative Adversarial Network. DrumGAN VST operates on 44.1 kHz sample-rate audio, offers independent and continuous instrument class controls, and features an encoding neural network that maps sounds into the GAN's latent space, enabling resynthesis and manipulation of pre-existing drum sounds. We provide numerous sound examples and a demo of the proposed VST plugin.  ( 2 min )
    Private Graph Extraction via Feature Explanations. (arXiv:2206.14724v1 [cs.LG])
    Privacy and interpretability are two of the important ingredients for achieving trustworthy machine learning. We study the interplay of these two aspects in graph machine learning through graph reconstruction attacks. The goal of the adversary here is to reconstruct the graph structure of the training data given access to model explanations. Based on the different kinds of auxiliary information available to the adversary, we propose several graph reconstruction attacks. We show that additional knowledge of post-hoc feature explanations substantially increases the success rate of these attacks. Further, we investigate in detail the differences between attack performance with respect to three different classes of explanation methods for graph neural networks: gradient-based, perturbation-based, and surrogate model-based methods. While gradient-based explanations reveal the most in terms of the graph structure, we find that these explanations do not always score high in utility. For the other two classes of explanations, privacy leakage increases with an increase in explanation utility. Finally, we propose a defense based on a randomized response mechanism for releasing the explanations which substantially reduces the attack success rate. Our anonymized code is available.  ( 2 min )
    On Monocular Depth Estimation and Uncertainty Quantification using Classification Approaches for Regression. (arXiv:2202.12369v2 [cs.CV] UPDATED)
    Monocular depth is important in many tasks, such as 3D reconstruction and autonomous driving. Deep learning based models achieve state-of-the-art performance in this field. A set of novel approaches for estimating monocular depth consists of transforming the regression task into a classification one. However, there is a lack of detailed descriptions and comparisons for Classification Approaches for Regression (CAR) in the community and no in-depth exploration of their potential for uncertainty estimation. To this end, this paper will introduce a taxonomy and summary of CAR approaches, a new uncertainty estimation solution for CAR, and a set of experiments on depth accuracy and uncertainty quantification for CAR-based models on KITTI dataset. The experiments reflect the differences in the portability of various CAR methods on two backbones. Meanwhile, the newly proposed method for uncertainty estimation can outperform the ensembling method with only one forward propagation.
    Ultra-sensitive Flexible Sponge-Sensor Array for Muscle Activities Detection and Human Limb Motion Recognition. (arXiv:2205.03238v2 [eess.SP] UPDATED)
    Human limb motion tracking and recognition plays an important role in medical rehabilitation training, lower limb assistance, prosthetics design for amputees, feedback control for assistive robots, etc. Lightweight wearable sensors, including inertial sensors, surface electromyography sensors, and flexible strain/pressure, are promising to become the next-generation human motion capture devices. Herein, we present a wireless wearable device consisting of a sixteen-channel flexible sponge-based pressure sensor array to recognize various human lower limb motions by detecting contours on the human skin caused by calf gastrocnemius muscle actions. Each sensing element is a round porous structure of thin carbon nanotube/polydimethylsiloxane nanocomposites with a diameter of 4 mm and thickness of about 400 {\mu}m. Ten human subjects were recruited to perform ten different lower limb motions while wearing the developed device. The motion classification result with the support vector machine method shows a macro-recall of about 97.3% for all ten motions tested. This work demonstrates a portable wearable muscle activity detection device with a lower limb motion recognition application, which can be potentially used in assistive robot control, healthcare, sports monitoring, etc.
    Variational Bayesian inference for CP tensor completion with side information. (arXiv:2206.12486v2 [cs.LG] UPDATED)
    We propose a message passing algorithm, based on variational Bayesian inference, for low-rank tensor completion with automatic rank determination in the canonical polyadic format when additional side information (SI) is given. The SI comes in the form of low-dimensional subspaces the contain the fiber spans of the tensor (columns, rows, tubes, etc.). We validate the regularization properties induced by SI with extensive numerical experiments on synthetic and real-world data and present the results about tensor recovery and rank determination. The results show that the number of samples required for successful completion is significantly reduced in the presence of SI. We also discuss the origin of a bump in the phase transition curves that exists when the dimensionality of SI is comparable with that of the tensor.
    Fast algorithm for overcomplete order-3 tensor decomposition. (arXiv:2202.06442v2 [cs.LG] UPDATED)
    We develop the first fast spectral algorithm to decompose a random third-order tensor over $\mathbb{R}^d$ of rank up to $O(d^{3/2}/\text{polylog}(d))$. Our algorithm only involves simple linear algebra operations and can recover all components in time $O(d^{6.05})$ under the current matrix multiplication time. Prior to this work, comparable guarantees could only be achieved via sum-of-squares [Ma, Shi, Steurer 2016]. In contrast, fast algorithms [Hopkins, Schramm, Shi, Steurer 2016] could only decompose tensors of rank at most $O(d^{4/3}/\text{polylog}(d))$. Our algorithmic result rests on two key ingredients. A clean lifting of the third-order tensor to a sixth-order tensor, which can be expressed in the language of tensor networks. A careful decomposition of the tensor network into a sequence of rectangular matrix multiplications, which allows us to have a fast implementation of the algorithm.
    Simulate Time-integrated Coarse-grained Molecular Dynamics with Geometric Machine Learning. (arXiv:2204.10348v2 [cs.LG] UPDATED)
    Molecular dynamics (MD) simulation is the workhorse of various scientific domains but is limited by high computational cost. Learning-based force fields have made major progress in accelerating ab-initio MD simulation but are still not fast enough for many real-world applications that require long-time MD simulation. In this paper, we adopt a different machine learning approach where we coarse-grain a physical system using graph clustering, and model the system evolution with a very large time-integration step using graph neural networks. A novel score-based GNN refinement module resolves the long-standing challenge of long-time simulation instability. Despite only trained with short MD trajectory data, our learned simulator can generalize to unseen novel systems and simulate for much longer than the training trajectories. Properties requiring 10-100 ns level long-time dynamics can be accurately recovered at several-orders-of-magnitude higher speed than classical force fields. We demonstrate the effectiveness of our method on two realistic complex systems: (1) single-chain coarse-grained polymers in implicit solvent; (2) multi-component Li-ion polymer electrolyte systems.
    A Learnable Variational Model for Joint Multimodal MRI Reconstruction and Synthesis. (arXiv:2204.03804v2 [eess.IV] UPDATED)
    Generating multi-contrasts/modal MRI of the same anatomy enriches diagnostic information but is limited in practice due to excessive data acquisition time. In this paper, we propose a novel deep-learning model for joint reconstruction and synthesis of multi-modal MRI using incomplete k-space data of several source modalities as inputs. The output of our model includes reconstructed images of the source modalities and high-quality image synthesized in the target modality. Our proposed model is formulated as a variational problem that leverages several learnable modality-specific feature extractors and a multimodal synthesis module. We propose a learnable optimization algorithm to solve this model, which induces a multi-phase network whose parameters can be trained using multi-modal MRI data. Moreover, a bilevel-optimization framework is employed for robust parameter training. We demonstrate the effectiveness of our approach using extensive numerical experiments.
    Uniform Convergence Rates for Lipschitz Learning on Graphs. (arXiv:2111.12370v2 [math.NA] UPDATED)
    Lipschitz learning is a graph-based semi-supervised learning method where one extends labels from a labeled to an unlabeled data set by solving the infinity Laplace equation on a weighted graph. In this work we prove uniform convergence rates for solutions of the graph infinity Laplace equation as the number of vertices grows to infinity. Their continuum limits are absolutely minimizing Lipschitz extensions with respect to the geodesic metric of the domain where the graph vertices are sampled from. We work under very general assumptions on the graph weights, the set of labeled vertices, and the continuum domain. Our main contribution is that we obtain quantitative convergence rates even for very sparsely connected graphs, as they typically appear in applications like semi-supervised learning. In particular, our framework allows for graph bandwidths down to the connectivity radius. For proving this we first show a quantitative convergence statement for graph distance functions to geodesic distance functions in the continuum. Using the "comparison with distance functions" principle, we can pass these convergence statements to infinity harmonic functions and absolutely minimizing Lipschitz extensions.
    Overcoming Oscillations in Quantization-Aware Training. (arXiv:2203.11086v2 [cs.LG] UPDATED)
    When training neural networks with simulated quantization, we observe that quantized weights can, rather unexpectedly, oscillate between two grid-points. The importance of this effect and its impact on quantization-aware training (QAT) are not well-understood or investigated in literature. In this paper, we delve deeper into the phenomenon of weight oscillations and show that it can lead to a significant accuracy degradation due to wrongly estimated batch-normalization statistics during inference and increased noise during training. These effects are particularly pronounced in low-bit ($\leq$ 4-bits) quantization of efficient networks with depth-wise separable layers, such as MobileNets and EfficientNets. In our analysis we investigate several previously proposed QAT algorithms and show that most of these are unable to overcome oscillations. Finally, we propose two novel QAT algorithms to overcome oscillations during training: oscillation dampening and iterative weight freezing. We demonstrate that our algorithms achieve state-of-the-art accuracy for low-bit (3 & 4 bits) weight and activation quantization of efficient architectures, such as MobileNetV2, MobileNetV3, and EfficentNet-lite on ImageNet. Our source code is available at {https://github.com/qualcomm-ai-research/oscillations-qat}.
    Order Constraints in Optimal Transport. (arXiv:2110.07275v2 [cs.LG] UPDATED)
    Optimal transport is a framework for comparing measures whereby a cost is incurred for transporting one measure to another. Recent works have aimed to improve optimal transport plans through the introduction of various forms of structure. We introduce novel order constraints into the optimal transport formulation to allow for the incorporation of structure. We define an efficient method for obtaining explainable solutions to the new formulation that scales far better than standard approaches. The theoretical properties of the method are provided. We demonstrate experimentally that order constraints improve explainability using the e-SNLI (Stanford Natural Language Inference) dataset that includes human-annotated rationales as well as on several image color transfer examples.
    Measuring Fairness under Unawareness of Sensitive Attributes: A Quantification-Based Approach. (arXiv:2109.08549v3 [cs.CY] UPDATED)
    Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
    Bayesian Structure Learning with Generative Flow Networks. (arXiv:2202.13903v2 [cs.LG] UPDATED)
    In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.
    Hidden Parameter Recurrent State Space Models For Changing Dynamics Scenarios. (arXiv:2206.14697v1 [cs.LG])
    Recurrent State-space models (RSSMs) are highly expressive models for learning patterns in time series data and system identification. However, these models assume that the dynamics are fixed and unchanging, which is rarely the case in real-world scenarios. Many control applications often exhibit tasks with similar but not identical dynamics which can be modeled as a latent variable. We introduce the Hidden Parameter Recurrent State Space Models (HiP-RSSMs), a framework that parametrizes a family of related dynamical systems with a low-dimensional set of latent factors. We present a simple and effective way of learning and performing inference over this Gaussian graphical model that avoids approximations like variational inference. We show that HiP-RSSMs outperforms RSSMs and competing multi-task models on several challenging robotic benchmarks both on real-world systems and simulations.
    Physics-informed Guided Disentanglement in Generative Networks. (arXiv:2107.14229v3 [cs.CV] UPDATED)
    Image-to-image translation (i2i) networks suffer from entanglement effects in presence of physics-related phenomena in target domain (such as occlusions, fog, etc), lowering altogether the translation quality, controllability and variability. In this paper, we build upon collection of simple physics models and present a comprehensive method for disentangling visual traits in target images, guiding the process with a physical model that renders some of the target traits, and learning the remaining ones. Because it allows explicit and interpretable outputs, our physical models (optimally regressed on target) allows generating unseen scenarios in a controllable manner. We also extend our framework, showing versatility to neural-guided disentanglement. The results show our disentanglement strategies dramatically increase performances qualitatively and quantitatively in several challenging scenarios for image translation.
    Acoustics-specific Piano Velocity Estimation. (arXiv:2203.16294v2 [cs.SD] UPDATED)
    Motivated by the state-of-art psychological research, we note that a piano performance transcribed with existing Automatic Music Transcription (AMT) methods cannot be successfully resynthesized without affecting the artistic content of the performance. This is due to 1) the different mappings between MIDI parameters used by different instruments, and 2) the fact that musicians adapt their way of playing to the surrounding acoustic environment. To face this issue, we propose a methodology to build acoustics-specific AMT systems that are able to model the adaptations that musicians apply to convey their interpretation. Specifically, we train models tailored for virtual instruments in a modular architecture that takes as input an audio recording and the relative aligned music score, and outputs the acoustics-specific velocities of each note. We test different model shapes and show that the proposed methodology generally outperforms the usual AMT pipeline which does not consider specificities of the instrument and of the acoustic environment. Interestingly, such a methodology is extensible in a straightforward way since only slight efforts are required to train models for the inference of other piano parameters, such as pedaling.
    On the R\'{e}nyi Cross-Entropy. (arXiv:2206.14329v1 [cs.IT])
    The R\'{e}nyi cross-entropy measure between two distributions, a generalization of the Shannon cross-entropy, was recently used as a loss function for the improved design of deep learning generative adversarial networks. In this work, we examine the properties of this measure and derive closed-form expressions for it when one of the distributions is fixed and when both distributions belong to the exponential family. We also analytically determine a formula for the cross-entropy rate for stationary Gaussian processes and for finite-alphabet Markov sources.  ( 2 min )
    Inferring Cyber Threat Intelligence -- A Knowledge Graph-based Approach. (arXiv:2102.05571v4 [cs.CR] UPDATED)
    Security analysts prepare threat analysis upon investigating an attack, an emerging cyber threat, or a recently discovered vulnerability. Threat intelligence on malware attacks and campaigns is shared on blog posts, reports, analyses, and tweets with varying technical details. Other security analysts use this intelligence to inform them of emerging threats, indicators of compromise, attack methods, and preventative measures. Collectively known as threat intelligence, it is typically in an unstructured format and, therefore, challenging to integrate seamlessly into existing IDPS systems. In this paper, we propose a framework that aggregates and combines CTI - the openly available cyber threat intelligence information. The information is extracted and stored in a structured format using knowledge graphs such that the semantics of the threat intelligence can be preserved and shared at scale with other security analysts. We propose the first semi-supervised open-source knowledge graph (KG) framework, TINKER, to capture cyber threat information and its context. Following TINKER, we generate a Cyberthreat Intelligence Knowledge Graph (CTI-KG). We demonstrate the efficacy of CTI-KG using different use cases and its application for security analysts.
    Backdoor Detection in Reinforcement Learning. (arXiv:2202.03609v3 [cs.LG] UPDATED)
    While the real world application of reinforcement learning (RL) is becoming popular, the safety concern and the robustness of an RL system require more attention. A recent work reveals that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. We propose the problem of RL Backdoor Detection, aiming to address this safety vulnerability. An interesting observation we drew from extensive empirical studies is a trigger smoothness property where normal actions similar to the backdoor trigger actions can also trigger low performance of the trojan agent. Inspired by this observation, we propose a reinforcement learning solution TrojanSeeker to find approximate trigger actions for the trojan agents, and further propose an efficient approach to mitigate the trojan agents based on machine unlearning. Experiments show that our approach can correctly distinguish and mitigate all the trojan agents across various types of agents and environments.
    Supervised Training of Conditional Monge Maps. (arXiv:2206.14262v1 [cs.LG])
    Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another. That theory has been mostly used to estimate, given a pair of source and target probability measures $(\mu,\nu)$, a parameterized map $T_\theta$ that can efficiently map $\mu$ onto $\nu$. In many applications, such as predicting cell responses to treatments, the data measures $\mu,\nu$ (features of untreated/treated cells) that define optimal transport problems do not arise in isolation but are associated with a context $c$ (the treatment). To account for and incorporate that context in OT estimation, we introduce CondOT, an approach to estimate OT maps conditioned on a context variable, using several pairs of measures $(\mu_i, \nu_i)$ tagged with a context label $c_i$. Our goal is to % extract from a dataset of labeled pairs $\{(c_i, (\mu_i, \nu_i))\}$ learn a global map $\mathcal{T}_{\theta}$ which is not only expected to fit em all pairs in the dataset $\{(c_i, (\mu_i, \nu_i))\}$, i.e., $\mathcal{T}_{\theta}(c_i) \sharp\mu_i \approx \nu_i$, but should generalize to produce meaningful maps $\mathcal{T}_{\theta}(c_{\text{new}})$ conditioned on unseen contexts $c_{\text{new}}$. Our approach harnesses and provides a novel usage for partially input convex neural networks, for which we introduce a robust and efficient initialization strategy inspired by Gaussian approximations. We demonstrate the ability of CondOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells, using only observations of the effects of said perturbations separately.  ( 3 min )
    Some variational recipes for quantum field theories. (arXiv:2109.05547v3 [quant-ph] UPDATED)
    Rapid developments of quantum information technology show promising opportunities for simulating quantum field theory in near-term quantum devices. In this work, we formulate the theory of (time-dependent) variational quantum simulation of the 1+1 dimensional $\lambda \phi^4$ quantum field theory including encoding, state preparation, and time evolution, with several numerical simulation results. These algorithms could be understood as near-term variational analogs of the Jordan-Lee-Preskill algorithm, the basic algorithm for simulating quantum field theory using universal quantum devices. Besides, we highlight the advantages of encoding with harmonic oscillator basis based on the LSZ reduction formula and several computational efficiency such as when implementing a bosonic version of the unitary coupled cluster ansatz to prepare initial states. We also discuss how to circumvent the "spectral crowding" problem in the quantum field theory simulation and appraise our algorithm by both state and subspace fidelities.
    Fast learning from label proportions with small bags. (arXiv:2110.03426v4 [cs.LG] UPDATED)
    In learning from label proportions (LLP), the instances are grouped into bags, and the task is to learn an instance classifier given relative class proportions in training bags. LLP is useful when obtaining individual instance labels is impossible or costly. In this work, we focus on the case of small bags, which allows to design an algorithm that explicitly considers all consistent instance label combinations. In particular, we propose an EM algorithm alternating between optimizing a general neural network instance classifier and incorporating bag-level annotations. Using two different image datasets, we experimentally compare this method with an approach based on normal approximation and two existing LLP methods. The results show that our approach converges faster to a comparable or better solution.
    Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. (arXiv:2110.02642v5 [cs.LG] UPDATED)
    Unsupervised detection of anomaly points in time series is a challenging problem, which requires the model to derive a distinguishable criterion. Previous methods tackle the problem mainly through learning pointwise representation or pairwise association, however, neither is sufficient to reason about the intricate dynamics. Recently, Transformers have shown great power in unified modeling of pointwise representation and pairwise association, and we find that the self-attention weight distribution of each time point can embody rich association with the whole series. Our key observation is that due to the rarity of anomalies, it is extremely difficult to build nontrivial associations from abnormal points to the whole series, thereby, the anomalies' associations shall mainly concentrate on their adjacent time points. This adjacent-concentration bias implies an association-based criterion inherently distinguishable between normal and abnormal points, which we highlight through the \emph{Association Discrepancy}. Technically, we propose the \emph{Anomaly Transformer} with a new \emph{Anomaly-Attention} mechanism to compute the association discrepancy. A minimax strategy is devised to amplify the normal-abnormal distinguishability of the association discrepancy. The Anomaly Transformer achieves state-of-the-art results on six unsupervised time series anomaly detection benchmarks of three applications: service monitoring, space & earth exploration, and water treatment.
    DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning. (arXiv:2204.08499v3 [cs.LG] UPDATED)
    Coreset selection, which aims to select a subset of the most informative training samples, is a long-standing learning problem that can benefit many downstream tasks such as data-efficient learning, continual learning, neural architecture search, active learning, etc. However, many existing coreset selection methods are not designed for deep learning, which may have high complexity and poor generalization performance. In addition, the recently proposed methods are evaluated on models, datasets, and settings of different complexities. To advance the research of coreset selection in deep learning, we contribute a comprehensive code library, namely DeepCore, and provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets. Extensive experiments on CIFAR10 and ImageNet datasets verify that, although various methods have advantages in certain experiment settings, random selection is still a strong baseline.
    Training OOD Detectors in their Natural Habitats. (arXiv:2202.03299v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is important for machine learning models deployed in the wild. Recent methods use auxiliary outlier data to regularize the model for improved OOD detection. However, these approaches make a strong distributional assumption that the auxiliary outlier data is completely separable from the in-distribution (ID) data. In this paper, we propose a novel framework that leverages wild mixture data, which naturally consists of both ID and OOD samples. Such wild data is abundant and arises freely upon deploying a machine learning classifier in their natural habitats. Our key idea is to formulate a constrained optimization problem and to show how to tractably solve it. Our learning objective maximizes the OOD detection rate, subject to constraints on the classification error of ID data and on the OOD error rate of ID examples. We extensively evaluate our approach on common OOD detection tasks and demonstrate superior performance.
    Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution. (arXiv:2009.14108v2 [cs.LG] UPDATED)
    Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at https://github.com/ml-jku/align-rudder. YouTube: https://youtu.be/HO-_8ZUl-UY
    Using cognitive psychology to understand GPT-3. (arXiv:2206.14576v1 [cs.CL])
    We study GPT-3, a recent large language model, using tools from cognitive psychology. More specifically, we assess GPT-3's decision-making, information search, deliberation, and causal reasoning abilities on a battery of canonical experiments from the literature. We find that much of GPT-3's behavior is impressive: it solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multi-armed bandit task, and shows signatures of model-based reinforcement learning. Yet we also find that small perturbations to vignette-based tasks can lead GPT-3 vastly astray, that it shows no signatures of directed exploration, and that it fails miserably in a causal reasoning task. These results enrich our understanding of current large language models and pave the way for future investigations using tools from cognitive psychology to study increasingly capable and opaque artificial agents.  ( 2 min )
    Neural Integro-Differential Equations. (arXiv:2206.14282v1 [cs.LG])
    Modeling continuous dynamical systems from discretely sampled observations is a fundamental problem in data science. Often, such dynamics are the result of non-local processes that present an integral over time. As such, these systems are modeled with Integro-Differential Equations (IDEs); generalizations of differential equations that comprise both an integral and a differential component. For example, brain dynamics are not accurately modeled by differential equations since their behavior is non-Markovian, i.e. dynamics are in part dictated by history. Here, we introduce the Neural IDE (NIDE), a framework that models ordinary and integral components of IDEs using neural networks. We test NIDE on several toy and brain activity datasets and demonstrate that NIDE outperforms other models, including Neural ODE. These tasks include time extrapolation as well as predicting dynamics from unseen initial conditions, which we test on whole-cortex activity recordings in freely behaving mice. Further, we show that NIDE can decompose dynamics into its Markovian and non-Markovian constituents, via the learned integral operator, which we test on fMRI brain activity recordings of people on ketamine. Finally, the integrand of the integral operator provides a latent space that gives insight into the underlying dynamics, which we demonstrate on wide-field brain imaging recordings. Altogether, NIDE is a novel approach that enables modeling of complex non-local dynamics with neural networks.  ( 3 min )
    Deep Neural Networks and Tabular Data: A Survey. (arXiv:2110.01889v3 [cs.LG] UPDATED)
    Heterogeneous tabular data are the most commonly used form of data and are essential for numerous critical and computationally demanding applications. On homogeneous data sets, deep neural networks have repeatedly shown excellent performance and have therefore been widely adopted. However, their adaptation to tabular data for inference or data generation tasks remains challenging. To facilitate further progress in the field, this work provides an overview of state-of-the-art deep learning methods for tabular data. We categorize these methods into three groups: data transformations, specialized architectures, and regularization models. For each of these groups, our work offers a comprehensive overview of the main approaches. Moreover, we discuss deep learning approaches for generating tabular data, and we also provide an overview over strategies for explaining deep models on tabular data. Thus, our first contribution is to address the main research streams and existing methodologies in the mentioned areas, while highlighting relevant challenges and open research questions. Our second contribution is to provide an empirical comparison of traditional machine learning methods with eleven deep learning approaches across five popular real-world tabular data sets of different sizes and with different learning objectives. Our results, which we have made publicly available as competitive benchmarks, indicate that algorithms based on gradient-boosted tree ensembles still mostly outperform deep learning models on supervised learning tasks, suggesting that the research progress on competitive deep learning models for tabular data is stagnating. To the best of our knowledge, this is the first in-depth overview of deep learning approaches for tabular data; as such, this work can serve as a valuable starting point to guide researchers and practitioners interested in deep learning with tabular data.
    Locally Interpretable One-Class Anomaly Detection for Credit Card Fraud Detection. (arXiv:2108.02501v3 [cs.LG] UPDATED)
    For the highly imbalanced credit card fraud detection problem, most existing methods either use data augmentation methods or conventional machine learning models, while neural network-based anomaly detection approaches are lacking. Furthermore, few studies have employed AI interpretability tools to investigate the feature importance of transaction data, which is crucial for the black-box fraud detection module. Considering these two points together, we propose a novel anomaly detection framework for credit card fraud detection as well as a model-explaining module responsible for prediction explanations. The fraud detection model is composed of two deep neural networks, which are trained in an unsupervised and adversarial manner. Precisely, the generator is an AutoEncoder aiming to reconstruct genuine transaction data, while the discriminator is a fully-connected network for fraud detection. The explanation module has three white-box explainers in charge of interpretations of the AutoEncoder, discriminator, and the whole detection model, respectively. Experimental results show the state-of-the-art performances of our fraud detection model on the benchmark dataset compared with baselines. In addition, prediction analyses by three explainers are presented, offering a clear perspective on how each feature of an instance of interest contributes to the final model output.
    MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment. (arXiv:2204.01345v2 [eess.AS] UPDATED)
    The acoustic environment can degrade speech quality during communication (e.g., video call, remote presentation, outside voice recording), and its impact is often unknown. Objective metrics for speech quality have proven challenging to develop given the multi-dimensionality of factors that affect speech quality and the difficulty of collecting labeled data. Hypothesizing the impact of acoustics on speech quality, this paper presents MOSRA: a non-intrusive multi-dimensional speech quality metric that can predict room acoustics parameters (SNR, STI, T60, DRR, and C50) alongside the overall mean opinion score (MOS) for speech quality. By explicitly optimizing the model to learn these room acoustics parameters, we can extract more informative features and improve the generalization for the MOS task when the training data is limited. Furthermore, we also show that this joint training method enhances the blind estimation of room acoustics, improving the performance of current state-of-the-art models. An additional side-effect of this joint prediction is the improvement in the explainability of the predictions, which is a valuable feature for many applications.  ( 2 min )
    Competence-based Multimodal Curriculum Learning for Medical Report Generation. (arXiv:2206.14579v1 [cs.CL])
    Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias and 2) the limited medical data. To alleviate the data bias and make best use of available data, we propose a Competence-based Multimodal Curriculum Learning framework (CMCL). Specifically, CMCL simulates the learning process of radiologists and optimizes the model in a step by step manner. Firstly, CMCL estimates the difficulty of each training instance and evaluates the competence of current model; Secondly, CMCL selects the most suitable batch of training instances considering current model competence. By iterating above two steps, CMCL can gradually improve the model's performance. The experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.  ( 2 min )
    QuantumFed: A Federated Learning Framework for Collaborative Quantum Training. (arXiv:2106.09109v4 [cs.LG] UPDATED)
    With the fast development of quantum computing and deep learning, quantum neural networks have attracted great attention recently. By leveraging the power of quantum computing, deep neural networks can potentially overcome computational power limitations in classic machine learning. However, when multiple quantum machines wish to train a global model using the local data on each machine, it may be very difficult to copy the data into one machine and train the model. Therefore, a collaborative quantum neural network framework is necessary. In this article, we borrow the core idea of federated learning to propose QuantumFed, a quantum federated learning framework to have multiple quantum nodes with local quantum data train a mode together. Our experiments show the feasibility and robustness of our framework.
    Depth-2 Neural Networks Under a Data-Poisoning Attack. (arXiv:2005.01699v3 [cs.LG] UPDATED)
    In this work, we study the possibility of defending against data-poisoning attacks while training a shallow neural network in a regression setup. We focus on doing supervised learning for a class of depth-2 finite-width neural networks, which includes single-filter convolutional networks. In this class of networks, we attempt to learn the network weights in the presence of a malicious oracle doing stochastic, bounded and additive adversarial distortions on the true output during training. For the non-gradient stochastic algorithm that we construct, we prove worst-case near-optimal trade-offs among the magnitude of the adversarial attack, the weight approximation accuracy, and the confidence achieved by the proposed algorithm. As our algorithm uses mini-batching, we analyze how the mini-batch size affects convergence. We also show how to utilize the scaling of the outer layer weights to counter output-poisoning attacks depending on the probability of attack. Lastly, we give experimental evidence demonstrating how our algorithm outperforms stochastic gradient descent under different input data distributions, including instances of heavy-tailed distributions.
    Deep Policies for Online Bipartite Matching: A Reinforcement Learning Approach. (arXiv:2109.10380v2 [cs.LG] UPDATED)
    The challenge in the widely applicable online matching problem lies in making irrevocable assignments while there is uncertainty about future inputs. Most theoretically-grounded policies are myopic or greedy in nature. In real-world applications where the matching process is repeated on a regular basis, the underlying data distribution can be leveraged for better decision-making. We present an end-to-end Reinforcement Learning framework for deriving better matching policies based on trial-and-error on historical data. We devise a set of neural network architectures, design feature representations, and empirically evaluate them across two online matching problems: Edge-Weighted Online Bipartite Matching and Online Submodular Bipartite Matching. We show that most of the learning approaches perform consistently better than classical baseline algorithms on four synthetic and real-world datasets. On average, our proposed models improve the matching quality by 3-10% on a variety of synthetic and real-world datasets. Our code is publicly available at https://github.com/lyeskhalil/CORL.
    When Do Extended Physics-Informed Neural Networks (XPINNs) Improve Generalization?. (arXiv:2109.09444v5 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) have become a popular choice for solving high-dimensional partial differential equations (PDEs) due to their excellent approximation power and generalization ability. Recently, Extended PINNs (XPINNs) based on domain decomposition methods have attracted considerable attention due to their effectiveness in modeling multiscale and multiphysics problems and their parallelization. However, theoretical understanding on their convergence and generalization properties remains unexplored. In this study, we take an initial step towards understanding how and when XPINNs outperform PINNs. Specifically, for general multi-layer PINNs and XPINNs, we first provide a prior generalization bound via the complexity of the target functions in the PDE problem, and a posterior generalization bound via the posterior matrix norms of the networks after optimization. Moreover, based on our bounds, we analyze the conditions under which XPINNs improve generalization. Concretely, our theory shows that the key building block of XPINN, namely the domain decomposition, introduces a tradeoff for generalization. On the one hand, XPINNs decompose the complex PDE solution into several simple parts, which decreases the complexity needed to learn each part and boosts generalization. On the other hand, decomposition leads to less training data being available in each subdomain, and hence such model is typically prone to overfitting and may become less generalizable. Empirically, we choose five PDEs to show when XPINNs perform better than, similar to, or worse than PINNs, hence demonstrating and justifying our new theory.
    Reinforcement Learning for Datacenter Congestion Control. (arXiv:2102.09337v2 [cs.LG] UPDATED)
    We approach the task of network congestion control in datacenters using Reinforcement Learning (RL). Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. Evidently, the most popular recent deployments rely on rule-based heuristics that are tested on a predetermined set of benchmarks. Consequently, these heuristics do not generalize well to newly-seen scenarios. Contrarily, we devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We overcome challenges such as partial-observability, non-stationarity, and multi-objectiveness. We further propose a policy gradient algorithm that leverages the analytical structure of the reward function to approximate its derivative and improve stability. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training. Our experiments, conducted on a realistic simulator that emulates communication networks' behavior, exhibit improved performance concurrently on the multiple considered metrics compared to the popular algorithms deployed today in real datacenters. Our algorithm is being productized to replace heuristics in some of the largest datacenters in the world.
    CoMoGAN: continuous model-guided image-to-image translation. (arXiv:2103.06879v3 [cs.CV] UPDATED)
    CoMoGAN is a continuous GAN relying on the unsupervised reorganization of the target data on a functional manifold. To that matter, we introduce a new Functional Instance Normalization layer and residual mechanism, which together disentangle image content from position on target manifold. We rely on naive physics-inspired models to guide the training while allowing private model/translations features. CoMoGAN can be used with any GAN backbone and allows new types of image translation, such as cyclic image translation like timelapse generation, or detached linear translation. On all datasets, it outperforms the literature. Our code is available at this http URL .
    Exploring the Latent Space of Autoencoders with Interventional Assays. (arXiv:2106.16091v2 [cs.LG] UPDATED)
    Autoencoders exhibit impressive abilities to embed the data manifold into a low-dimensional latent space, making them a staple of representation learning methods. However, without explicit supervision, which is often unavailable, the representation is usually uninterpretable, making analysis and principled progress challenging. We propose a framework, called latent responses, which exploits the locally contractive behavior exhibited by variational autoencoders to explore the learned manifold. More specifically, we develop tools to probe the representation using interventions in the latent space to quantify the relationships between latent variables. We extend the notion of disentanglement to take the learned generative process into account and consequently avoid the limitations of existing metrics that may rely on spurious correlations. Our analyses underscore the importance of studying the causal structure of the representation to improve performance on downstream tasks such as generation, interpolation, and inference of the factors of variation.
    Matching Learned Causal Effects of Neural Networks with Domain Priors. (arXiv:2111.12490v4 [cs.LG] UPDATED)
    A trained neural network can be interpreted as a structural causal model (SCM) that provides the effect of changing input variables on the model's output. However, if training data contains both causal and correlational relationships, a model that optimizes prediction accuracy may not necessarily learn the true causal relationships between input and output variables. On the other hand, expert users often have prior knowledge of the causal relationship between certain input variables and output from domain knowledge. Therefore, we propose a regularization method that aligns the learned causal effects of a neural network with domain priors, including both direct and total causal effects. We show that this approach can generalize to different kinds of domain priors, including monotonicity of causal effect of an input variable on output or zero causal effect of a variable on output for purposes of fairness. Our experiments on twelve benchmark datasets show its utility in regularizing a neural network model to maintain desired causal effects, without compromising on accuracy. Importantly, we also show that a model thus trained is robust and gets improved accuracy on noisy inputs.
    BiometryNet: Landmark-based Fetal Biometry Estimation from Standard Ultrasound Planes. (arXiv:2206.14678v1 [eess.IV])
    Fetal growth assessment from ultrasound is based on a few biometric measurements that are performed manually and assessed relative to the expected gestational age. Reliable biometry estimation depends on the precise detection of landmarks in standard ultrasound planes. Manual annotation can be time-consuming and operator dependent task, and may results in high measurements variability. Existing methods for automatic fetal biometry rely on initial automatic fetal structure segmentation followed by geometric landmark detection. However, segmentation annotations are time-consuming and may be inaccurate, and landmark detection requires developing measurement-specific geometric methods. This paper describes BiometryNet, an end-to-end landmark regression framework for fetal biometry estimation that overcomes these limitations. It includes a novel Dynamic Orientation Determination (DOD) method for enforcing measurement-specific orientation consistency during network training. DOD reduces variabilities in network training, increases landmark localization accuracy, thus yields accurate and robust biometric measurements. To validate our method, we assembled a dataset of 3,398 ultrasound images from 1,829 subjects acquired in three clinical sites with seven different ultrasound devices. Comparison and cross-validation of three different biometric measurements on two independent datasets shows that BiometryNet is robust and yields accurate measurements whose errors are lower than the clinically permissible errors, outperforming other existing automated biometry estimation methods. Code is available at https://github.com/netanellavisdris/fetalbiometry.  ( 3 min )
    An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates. (arXiv:2206.14707v1 [cs.LG])
    We formalize and study the natural approach of designing convex surrogate loss functions via embeddings, for problems such as classification, ranking, or structured prediction. In this approach, one embeds each of the finitely many predictions (e.g. rankings) as a point in $R^d$, assigns the original loss values to these points, and "convexifies" the loss in some way to obtain a surrogate. We establish a strong connection between this approach and polyhedral (piecewise-linear convex) surrogate losses: every discrete loss is embedded by some polyhedral loss, and every polyhedral loss embeds some discrete loss. Moreover, an embedding gives rise to a consistent link function as well as linear surrogate regret bounds. Our results are constructive, as we illustrate with several examples. In particular, our framework gives succinct proofs of consistency or inconsistency for various polyhedral surrogates in the literature, and for inconsistent surrogates, it further reveals the discrete losses for which these surrogates are consistent. We go on to show additional structure of embeddings, such as the equivalence of embedding and matching Bayes risks, and the equivalence of various notions of non-redudancy. Using these results, we establish that indirect elicitation, a necessary condition for consistency, is also sufficient when working with polyhedral surrogates.
    Representation Topology Divergence: A Method for Comparing Neural Network Representations. (arXiv:2201.00058v2 [cs.LG] UPDATED)
    Comparison of data representations is a complex multi-aspect problem that has not enjoyed a complete solution yet. We propose a method for comparing two data representations. We introduce the Representation Topology Divergence (RTD), measuring the dissimilarity in multi-scale topology between two point clouds of equal size with a one-to-one correspondence between points. The data point clouds are allowed to lie in different ambient spaces. The RTD is one of the few TDA-based practical methods applicable to real machine learning datasets. Experiments show that the proposed RTD agrees with the intuitive assessment of data representation similarity and is sensitive to its topological structure. We apply RTD to gain insights on neural networks representations in computer vision and NLP domains for various problems: training dynamics analysis, data distribution shift, transfer learning, ensemble learning, disentanglement assessment.
    Visual Foresight With a Local Dynamics Model. (arXiv:2206.14802v1 [cs.RO])
    Model-free policy learning has been shown to be capable of learning manipulation policies which can solve long-time horizon tasks using single-step manipulation primitives. However, training these policies is a time-consuming process requiring large amounts of data. We propose the Local Dynamics Model (LDM) which efficiently learns the state-transition function for these manipulation primitives. By combining the LDM with model-free policy learning, we can learn policies which can solve complex manipulation tasks using one-step lookahead planning. We show that the LDM is both more sample-efficient and outperforms other model architectures. When combined with planning, we can outperform other model-based and model-free policies on several challenging manipulation tasks in simulation.
    Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. (arXiv:2201.12417v2 [cs.LG] UPDATED)
    In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.
    Deep Multiple Instance Learning For Forecasting Stock Trends Using Financial News. (arXiv:2206.14452v1 [cs.LG])
    A major source of information can be taken from financial news articles, which have some correlations about the fluctuation of stock trends. In this paper, we investigate the influences of financial news on the stock trends, from a multi-instance view. The intuition behind this is based on the news uncertainty of varying intervals of news occurrences and the lack of annotation in every single financial news. Under the scenario of Multiple Instance Learning (MIL) where training instances are arranged in bags, and a label is assigned for the entire bag instead of instances, we develop a flexible and adaptive multi-instance learning model and evaluate its ability in directional movement forecast of Standard & Poors 500 index on financial news dataset. Specifically, we treat each trading day as one bag, with certain amounts of news happening on each trading day as instances in each bag. Experiment results demonstrate that our proposed multi-instance-based framework gains outstanding results in terms of the accuracy of trend prediction, compared with other state-of-art approaches and baselines.  ( 2 min )
    Conditionally Elicitable Dynamic Risk Measures for Deep Reinforcement Learning. (arXiv:2206.14666v1 [cs.LG])
    We propose a novel framework to solve risk-sensitive reinforcement learning (RL) problems where the agent optimises time-consistent dynamic spectral risk measures. Based on the notion of conditional elicitability, our methodology constructs (strictly consistent) scoring functions that are used as penalizers in the estimation procedure. Our contribution is threefold: we (i) devise an efficient approach to estimate a class of dynamic spectral risk measures with deep neural networks, (ii) prove that these dynamic spectral risk measures may be approximated to any arbitrary accuracy using deep neural networks, and (iii) develop a risk-sensitive actor-critic algorithm that uses full episodes and does not require any additional nested transitions. We compare our conceptually improved reinforcement learning algorithm with the nested simulation approach and illustrate its performance in two settings: statistical arbitrage and portfolio allocation on both simulated and real data.
    3D-Aware Video Generation. (arXiv:2206.14797v1 [cs.CV])
    Generative models have emerged as an essential building block for many image synthesis and editing tasks. Recent advances in this field have also enabled high-quality 3D or video content to be generated that exhibits either multi-view or temporal consistency. With our work, we explore 4D generative adversarial networks (GANs) that learn unconditional generation of 3D-aware videos. By combining neural implicit representations with time-aware discriminator, we develop a GAN framework that synthesizes 3D video supervised only with monocular videos. We show that our method learns a rich embedding of decomposable 3D structures and motions that enables new visual effects of spatio-temporal renderings while producing imagery with quality comparable to that of existing 3D or video GANs.
    Online vs. Offline Adaptive Domain Randomization Benchmark. (arXiv:2206.14661v1 [cs.RO])
    Physics simulators have shown great promise for conveniently learning reinforcement learning policies in safe, unconstrained environments. However, transferring the acquired knowledge to the real world can be challenging due to the reality gap. To this end, several methods have been recently proposed to automatically tune simulator parameters with posterior distributions given real data, for use with domain randomization at training time. These approaches have been shown to work for various robotic tasks under different settings and assumptions. Nevertheless, existing literature lacks a thorough comparison of existing adaptive domain randomization methods with respect to transfer performance and real-data efficiency. In this work, we present an open benchmark for both offline and online methods (SimOpt, BayRn, DROID, DROPO), to shed light on which are most suitable for each setting and task at hand. We found that online methods are limited by the quality of the currently learned policy for the next iteration, while offline methods may sometimes fail when replaying trajectories in simulation with open-loop commands. The code used will be released at https://github.com/gabrieletiboni/adr-benchmark.
    SENTINEL: Taming Uncertainty with Ensemble-based Distributional Reinforcement Learning. (arXiv:2102.11075v3 [cs.LG] UPDATED)
    In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINEL-K, based on ensemble bootstrapping and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms.
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v2 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors: $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the overparametrized region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.
    Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra. (arXiv:2011.04125v7 [cs.DS] UPDATED)
    We create classical (non-quantum) dynamic data structures supporting queries for recommender systems and least-squares regression that are comparable to their quantum analogues. De-quantizing such algorithms has received a flurry of attention in recent years; we obtain sharper bounds for these problems. More significantly, we achieve these improvements by arguing that the previous quantum-inspired algorithms for these problems are doing leverage or ridge-leverage score sampling in disguise; these are powerful and standard techniques in randomized numerical linear algebra. With this recognition, we are able to employ the large body of work in numerical linear algebra to obtain algorithms for these problems that are simpler or faster (or both) than existing approaches. Our experiments demonstrate that the proposed data structures also work well on real-world datasets.
    Manifold Topology Divergence: a Framework for Comparing Data Manifolds. (arXiv:2106.04024v2 [cs.LG] CROSS LISTED)
    We develop a framework for comparing data manifolds, aimed, in particular, towards the evaluation of deep generative models. We describe a novel tool, Cross-Barcode(P,Q), that, given a pair of distributions in a high-dimensional space, tracks multiscale topology spacial discrepancies between manifolds on which the distributions are concentrated. Based on the Cross-Barcode, we introduce the Manifold Topology Divergence score (MTop-Divergence) and apply it to assess the performance of deep generative models in various domains: images, 3D-shapes, time-series, and on different datasets: MNIST, Fashion MNIST, SVHN, CIFAR10, FFHQ, chest X-ray images, market stock data, ShapeNet. We demonstrate that the MTop-Divergence accurately detects various degrees of mode-dropping, intra-mode collapse, mode invention, and image disturbance. Our algorithm scales well (essentially linearly) with the increase of the dimension of the ambient high-dimensional space. It is one of the first TDA-based practical methodologies that can be applied universally to datasets of different sizes and dimensions, including the ones on which the most recent GANs in the visual domain are trained. The proposed method is domain agnostic and does not rely on pre-trained networks.
    Traffic Management of Autonomous Vehicles using Policy Based Deep Reinforcement Learning and Intelligent Routing. (arXiv:2206.14608v1 [cs.LG])
    Deep Reinforcement Learning (DRL) uses diverse, unstructured data and makes RL capable of learning complex policies in high dimensional environments. Intelligent Transportation System (ITS) based on Autonomous Vehicles (AVs) offers an excellent playground for policy-based DRL. Deep learning architectures solve computational challenges of traditional algorithms while helping in real-world adoption and deployment of AVs. One of the main challenges in AVs implementation is that it can worsen traffic congestion on roads if not reliably and efficiently managed. Considering each vehicle's holistic effect and using efficient and reliable techniques could genuinely help optimise traffic flow management and congestion reduction. For this purpose, we proposed a intelligent traffic control system that deals with complex traffic congestion scenarios at intersections and behind the intersections. We proposed a DRL-based signal control system that dynamically adjusts traffic signals according to the current congestion situation on intersections. To deal with the congestion on roads behind the intersection, we used re-routing technique to load balance the vehicles on road networks. To achieve the actual benefits of the proposed approach, we break down the data silos and use all the data coming from sensors, detectors, vehicles and roads in combination to achieve sustainable results. We used SUMO micro-simulator for our simulations. The significance of our proposed approach is manifested from the results.
    Quantification of Deep Neural Network Prediction Uncertainties for VVUQ of Machine Learning Models. (arXiv:2206.14615v1 [cs.LG])
    Recent performance breakthroughs in Artificial intelligence (AI) and Machine learning (ML), especially advances in Deep learning (DL), the availability of powerful, easy-to-use ML libraries (e.g., scikit-learn, TensorFlow, PyTorch.), and increasing computational power have led to unprecedented interest in AI/ML among nuclear engineers. For physics-based computational models, Verification, Validation and Uncertainty Quantification (VVUQ) have been very widely investigated and a lot of methodologies have been developed. However, VVUQ of ML models has been relatively less studied, especially in nuclear engineering. In this work, we focus on UQ of ML models as a preliminary step of ML VVUQ, more specifically, Deep Neural Networks (DNNs) because they are the most widely used supervised ML algorithm for both regression and classification tasks. This work aims at quantifying the prediction, or approximation uncertainties of DNNs when they are used as surrogate models for expensive physical models. Three techniques for UQ of DNNs are compared, namely Monte Carlo Dropout (MCD), Deep Ensembles (DE) and Bayesian Neural Networks (BNNs). Two nuclear engineering examples are used to benchmark these methods, (1) time-dependent fission gas release data using the Bison code, and (2) void fraction simulation based on the BFBT benchmark using the TRACE code. It was found that the three methods typically require different DNN architectures and hyperparameters to optimize their performance. The UQ results also depend on the amount of training data available and the nature of the data. Overall, all these three methods can provide reasonable estimations of the approximation uncertainties. The uncertainties are generally smaller when the mean predictions are close to the test data, while the BNN methods usually produce larger uncertainties than MCD and DE.
    Latent Combinational Game Design. (arXiv:2206.14203v1 [cs.LG])
    We present an approach for generating playable games that blend a given set of games in a desired combination using deep generative latent variable models. We refer to this approach as latent combinational game design -- latent since we use learned latent representations to perform blending, combinational since game blending is a combinational creativity process and game design since the approach generates novel, playable games. We use Gaussian Mixture Variational Autoencoders (GMVAEs), which use a mixture of Gaussians to model the VAE latent space. Through supervised training, each component learns to encode levels from one game and lets us define new, blended games as linear combinations of these learned components. This enables generating new games that blend the input games as well as control the relative proportions of each game in the blend. We also extend prior work using conditional VAEs to perform blending and compare against the GMVAE. Our results show that both models can generate playable blended games that blend the input games in the desired proportions.
    Distilling Model Failures as Directions in Latent Space. (arXiv:2206.14754v1 [cs.LG])
    Existing methods for isolating hard subpopulations and spurious correlations in datasets often require human intervention. This can make these methods labor-intensive and dataset-specific. To address these shortcomings, we present a scalable method for automatically distilling a model's failure modes. Specifically, we harness linear classifiers to identify consistent error patterns, and, in turn, induce a natural representation of these failure modes as directions within the feature space. We demonstrate that this framework allows us to discover and automatically caption challenging subpopulations within the training dataset, and intervene to improve the model's performance on these subpopulations. Code available at https://github.com/MadryLab/failure-directions
    Imaging the time series of one single referenced EEG electrode for Epileptic Seizures Risk Analysis. (arXiv:2206.14520v1 [cs.LG])
    The time series captured by a single scalp electrode (plus the reference electrode) of refractory epileptic patients is used to forecast seizures susceptibility. The time series is preprocessed, segmented, and each segment transformed into an image, using three different known methods: Recurrence Plot, Gramian Angular Field, Markov Transition Field. The likelihood of the occurrence of a seizure in a future predefined time window is computed by averaging the output of the softmax layer of a CNN, differently from the usual consideration of the output of the classification layer. By thresholding this likelihood, seizure forecasting has better performance. Interestingly, for almost every patient, the best threshold was different from 50%. The results show that this technique can predict with good results for some seizures and patients. However, more tests, namely more patients and more seizures, are needed to better understand the real potential of this technique.
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v2 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.
    EBMs vs. CL: Exploring Self-Supervised Visual Pretraining for Visual Question Answering. (arXiv:2206.14355v1 [cs.CV])
    The availability of clean and diverse labeled data is a major roadblock for training models on complex tasks such as visual question answering (VQA). The extensive work on large vision-and-language models has shown that self-supervised learning is effective for pretraining multimodal interactions. In this technical report, we focus on visual representations. We review and evaluate self-supervised methods to leverage unlabeled images and pretrain a model, which we then fine-tune on a custom VQA task that allows controlled evaluation and diagnosis. We compare energy-based models (EBMs) with contrastive learning (CL). While EBMs are growing in popularity, they lack an evaluation on downstream tasks. We find that both EBMs and CL can learn representations from unlabeled images that enable training a VQA model on very little annotated data. In a simple setting similar to CLEVR, we find that CL representations also improve systematic generalization, and even match the performance of representations from a larger, supervised, ImageNet-pretrained model. However, we find EBMs to be difficult to train because of instabilities and high variability in their results. Although EBMs prove useful for OOD detection, other results on supervised energy-based training and uncertainty calibration are largely negative. Overall, CL currently seems a preferable option over EBMs.
    An Auto-Regressive Formulation for Smoothing and Moving Mean with Exponentially Tapered Windows. (arXiv:2206.14749v1 [cs.LG])
    We investigate an auto-regressive formulation for the problem of smoothing time-series by manipulating the inherent objective function of the traditional moving mean smoothers. Not only the auto-regressive smoothers enforce a higher degree of smoothing, they are just as efficient as the traditional moving means and can be optimized accordingly with respect to the input dataset. Interestingly, the auto-regressive models result in moving means with exponentially tapered windows.
    Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision. (arXiv:2206.14719v1 [cs.CL])
    Clinical trials are essential for drug development but are extremely expensive and time-consuming to conduct. It is beneficial to study similar historical trials when designing a clinical trial. However, lengthy trial documents and lack of labeled data make trial similarity search difficult. We propose a zero-shot clinical trial retrieval method, Trial2Vec, which learns through self-supervision without annotating similar clinical trials. Specifically, the meta-structure of trial documents (e.g., title, eligibility criteria, target disease) along with clinical knowledge (e.g., UMLS knowledge base https://www.nlm.nih.gov/research/umls/index.html) are leveraged to automatically generate contrastive samples. Besides, Trial2Vec encodes trial documents considering meta-structure thus producing compact embeddings aggregating multi-aspect information from the whole document. We show that our method yields medically interpretable embeddings by visualization and it gets a 15% average improvement over the best baselines on precision/recall for trial retrieval, which is evaluated on our labeled 1600 trial pairs. In addition, we prove the pre-trained embeddings benefit the downstream trial outcome prediction task over 240k trials.
    When Does Group Invariant Learning Survive Spurious Correlations?. (arXiv:2206.14534v1 [cs.LG])
    By inferring latent groups in the training data, recent works introduce invariant learning to the case where environment annotations are unavailable. Typically, learning group invariance under a majority/minority split is empirically shown to be effective in improving out-of-distribution generalization on many datasets. However, theoretical guarantee for these methods on learning invariant mechanisms is lacking. In this paper, we reveal the insufficiency of existing group invariant learning methods in preventing classifiers from depending on spurious correlations in the training set. Specifically, we propose two criteria on judging such sufficiency. Theoretically and empirically, we show that existing methods can violate both criteria and thus fail in generalizing to spurious correlation shifts. Motivated by this, we design a new group invariant learning method, which constructs groups with statistical independence tests, and reweights samples by group label proportion to meet the criteria. Experiments on both synthetic and real data demonstrate that the new method significantly outperforms existing group invariant learning methods in generalizing to spurious correlation shifts.
    A Multilingual Dataset of COVID-19 Vaccination Attitudes on Twitter. (arXiv:2206.14619v1 [cs.CL])
    Vaccine hesitancy is considered as one main cause of the stagnant uptake ratio of COVID-19 vaccines in Europe and the US where vaccines are sufficiently supplied. Fast and accurate grasp of public attitudes toward vaccination is critical to address vaccine hesitancy, and social media platforms have proved to be an effective source of public opinions. In this paper, we describe the collection and release of a dataset of tweets related to COVID-19 vaccines. This dataset consists of the IDs of 2,198,090 tweets collected from Western Europe, 17,934 of which are annotated with the originators' vaccination stances. Our annotation will facilitate using and developing data-driven models to extract vaccination attitudes from social media posts and thus further confirm the power of social media in public health surveillance. To lay the groundwork for future research, we not only perform statistical analysis and visualisation of our dataset, but also evaluate and compare the performance of established text-based benchmarks in vaccination stance extraction. We demonstrate one potential use of our data in practice in tracking the temporal changes of public COVID-19 vaccination attitudes.
    Probabilistic Models for Manufacturing Lead Times. (arXiv:2204.13792v2 [cs.LG] UPDATED)
    In this study, we utilize Gaussian processes, probabilistic neural network, natural gradient boosting, and quantile regression augmented gradient boosting to model lead times of laser manufacturing processes. We introduce probabilistic modelling in the domain and compare the models in terms of different abilities. While providing a comparison between the models in real-life data, our work has many use cases and substantial business value. Our results indicate that all of the models beat the company estimation benchmark that uses domain experience and have good calibration with the empirical frequencies.
    What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0. (arXiv:2206.14348v1 [cs.CL])
    Performance in natural language processing, and specifically for the question-answer task, is typically measured by comparing a model\'s most confident (primary) prediction to golden answers (the ground truth). We are making the case that it is also useful to quantify how close a model came to predicting a correct answer even for examples that failed. We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth, and show why such a match always exists. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We refer to secondary predictions as those ranking above 0 in descending confidence probability order. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from persistent near successes to persistent extreme failures. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model. To develop some intuition and explore the applicability of these metrics we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular transformer models from the Hugging Face hub. We first demonstrate that the GRIM is not directly correlated with the F1 and exact match (EM) scores. We then calculate and visualize these scores for various transformer architectures, probe their applicability in error analysis by clustering failed predictions, and compare how they relate to other training diagnostics such as the EM and F1 scores. We finally suggest various research goals, such as broadening data collection for these metrics and their possible use in adversarial training.
    Auto-Encoder-Extreme Learning Machine Model for Boiler NOx Emission Concentration Prediction. (arXiv:2206.14496v1 [cs.LG])
    An automatic encoder (AE) extreme learning machine (ELM)-AE-ELM model is proposed to predict the NOx emission concentration based on the combination of mutual information algorithm (MI), AE, and ELM. First, the importance of practical variables is computed by the MI algorithm, and the mechanism is analyzed to determine the variables related to the NOx emission concentration. Then, the time delay correlations between the selected variables and NOx emission concentration are further analyzed to reconstruct the modeling data. Subsequently, the AE is applied to extract hidden features within the input variables. Finally, an ELM algorithm establishes the relationship between the NOx emission concentration and deep features. The experimental results on practical data indicate that the proposed model shows promising performance compared to state-of-art models.
    PyEPO: A PyTorch-based End-to-End Predict-then-Optimize Library for Linear and Integer Programming. (arXiv:2206.14234v1 [math.OC])
    In deterministic optimization, it is typically assumed that all parameters of the problem are fixed and known. In practice, however, some parameters may be a priori unknown but can be estimated from historical data. A typical predict-then-optimize approach separates predictions and optimization into two stages. Recently, end-to-end predict-then-optimize has become an attractive alternative. In this work, we present the PyEPO package, a PyTorch-based end-to-end predict-then-optimize library in Python. To the best of our knowledge, PyEPO (pronounced like "pineapple" with a silent "n") is the first such generic tool for linear and integer programming with predicted objective function coefficients. It provides two base algorithms: the first is based on the convex surrogate loss function from the seminal work of Elmachtoub & Grigas (2021), and the second is based on the differentiable black-box solver approach of Vlastelica et al. (2019). PyEPO provides a simple interface for the definition of new optimization problems, the implementation of state-of-the-art predict-then-optimize training algorithms, the use of custom neural network architectures, and the comparison of end-to-end approaches with the two-stage approach. PyEPO enables us to conduct a comprehensive set of experiments comparing a number of end-to-end and two-stage approaches along axes such as prediction accuracy, decision quality, and running time on problems such as Shortest Path, Multiple Knapsack, and the Traveling Salesperson Problem. We discuss some empirical insights from these experiments which could guide future research. PyEPO and its documentation are available at https://github.com/khalil-research/PyEPO.
    Benchmarking Bayesian Improved Surname Geocoding Against Machine Learning Methods. (arXiv:2206.14583v1 [cs.LG])
    Bayesian Improved Surname Geocoding (BISG) is the most popular method for proxying race/ethnicity in voter registration files that do not contain it. This paper benchmarks BISG against a range of previously untested machine learning alternatives, using voter files with self-reported race/ethnicity from California, Florida, North Carolina, and Georgia. This analysis yields three key findings. First, when given the exact same inputs, BISG and machine learning perform similarly for estimating aggregate racial/ethnic composition. Second, machine learning outperforms BISG at individual classification of race/ethnicity. Third, the performance of all methods varies substantially across states. These results suggest that pre-trained machine learning models are preferable to BISG for individual classification. Furthermore, mixed results at the precinct level and across states underscore the need for researchers to empirically validate their chosen race/ethnicity proxy in their populations of interest.
    Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding. (arXiv:2206.14318v1 [cs.CL])
    End-to-end spoken language understanding (SLU) systems benefit from pretraining on large corpora, followed by fine-tuning on application-specific data. The resulting models are too large for on-edge applications. For instance, BERT-based systems contain over 110M parameters. Observing the model is overparameterized, we propose lean transformer structure where the dimension of the attention mechanism is automatically reduced using group sparsity. We propose a variant where the learned attention subspace is transferred to an attention bottleneck layer. In a low-resource setting and without pre-training, the resulting compact SLU model achieves accuracies competitive with pre-trained large models.
    Evaluating Generative Patent Language Models. (arXiv:2206.14578v1 [cs.CL])
    This research aims to build generative language models in the patent domain and to evaluate the models from a human-centric perspective. The evaluation metric is to calculate the ratio of keystrokes that can be saved for a user in an autocomplete context based on the prediction of the generative models. The performance of models in different sizes can also be evaluated in such a metric by measuring a number of newly granted patents. On the basis of the metric, it is found that the largest model is not necessarily the best. Several models are pre-trained from scratch with patent corpus and are released. The experiments in this manuscript focus on patent claims, but the ideas and implementation can be applied to other parts of a patent document. Furthermore, this research is motivated to measure how close the pre-trained language model can generate a newly granted patent claim. Or, conversely, the task is to measure the probabilities for the model to generate each token text given the newly granted patent claim. In addition, this manuscript raises several legal implications on patent law for potential interdisciplinary research in the future. In particular, can the metric based on model prediction be a metric to measure the nonobviousness requirement in the patent law?
    Spherical Channels for Modeling Atomic Interactions. (arXiv:2206.14331v1 [physics.chem-ph])
    Modeling the energy and forces of atomic systems is a fundamental problem in computational chemistry with the potential to help address many of the world's most pressing problems, including those related to energy scarcity and climate change. These calculations are traditionally performed using Density Functional Theory, which is computationally very expensive. Machine learning has the potential to dramatically improve the efficiency of these calculations from days or hours to seconds. We propose the Spherical Channel Network (SCN) to model atomic energies and forces. The SCN is a graph neural network where nodes represent atoms and edges their neighboring atoms. The atom embeddings are a set of spherical functions, called spherical channels, represented using spherical harmonics. We demonstrate, that by rotating the embeddings based on the 3D edge orientation, more information may be utilized while maintaining the rotational equivariance of the messages. While equivariance is a desirable property, we find that by relaxing this constraint in both message passing and aggregation, improved accuracy may be achieved. We demonstrate state-of-the-art results on the large-scale Open Catalyst 2020 dataset in both energy and force prediction for numerous tasks and metrics.
    TE2Rules: Extracting Rule Lists from Tree Ensembles. (arXiv:2206.14359v1 [cs.LG])
    Tree Ensemble (TE) models (e.g. Gradient Boosted Trees and Random Forests) often provide higher prediction performance compared to single decision trees. However, TE models generally lack transparency and interpretability, as humans have difficulty understanding their decision logic. This paper presents a novel approach to convert a TE trained for a binary classification task, to a rule list (RL) that is a global equivalent to the TE and is comprehensible for a human. This RL captures all necessary and sufficient conditions for decision making by the TE. Experiments on benchmark datasets demonstrate that, compared to state-of-the-art methods, (i) predictions from the RL generated by TE2Rules have high fidelity with respect to the original TE, (ii) the RL from TE2Rules has high interpretability measured by the number and the length of the decision rules, (iii) the run-time of TE2Rules algorithm can be reduced significantly at the cost of a slightly lower fidelity, and (iv) the RL is a fast alternative to the state-of-the-art rule-based instance-level outcome explanation techniques.
    Online Anomaly Detection Based On Reservoir Sampling and LOF for IoT devices. (arXiv:2206.14265v1 [cs.LG])
    The growing number of IoT devices and their use to monitor the operation of machines and equipment increases interest in anomaly detection algorithms running on devices. However, the difficulty is the limitations of the available computational and memory resources on the devices. In the case of microcontrollers (MCUs), these are single megabytes of program and several hundred kilobytes of working memory. Consequently, algorithms must be appropriately matched to the capabilities of the devices. In the paper, we analyse the processing pipeline for anomaly detection and implementation of the Local Outliner Factor (LOF) algorithm on a MCU. We also show that it is possible to train such an algorithm directly on the device, which gives great potential to use the solution in real devices.
    Why patient data cannot be easily forgotten?. (arXiv:2206.14541v1 [cs.LG])
    Rights provisioned within data protection regulations, permit patients to request that knowledge about their information be eliminated by data holders. With the advent of AI learned on data, one can imagine that such rights can extent to requests for forgetting knowledge of patient's data within AI models. However, forgetting patients' imaging data from AI models, is still an under-explored problem. In this paper, we study the influence of patient data on model performance and formulate two hypotheses for a patient's data: either they are common and similar to other patients or form edge cases, i.e. unique and rare cases. We show that it is not possible to easily forget patient data. We propose a targeted forgetting approach to perform patient-wise forgetting. Extensive experiments on the benchmark Automated Cardiac Diagnosis Challenge dataset showcase the improved performance of the proposed targeted forgetting approach as opposed to a state-of-the-art method.
    Overview of Deep Learning-based CSI Feedback in Massive MIMO Systems. (arXiv:2206.14383v1 [eess.SP])
    Many performance gains achieved by massive multiple-input and multiple-output depend on the accuracy of the downlink channel state information (CSI) at the transmitter (base station), which is usually obtained by estimating at the receiver (user terminal) and feeding back to the transmitter. The overhead of CSI feedback occupies substantial uplink bandwidth resources, especially when the number of the transmit antennas is large. Deep learning (DL)-based CSI feedback refers to CSI compression and reconstruction by a DL-based autoencoder and can greatly reduce feedback overhead. In this paper, a comprehensive overview of state-of-the-art research on this topic is provided, beginning with basic DL concepts widely used in CSI feedback and then categorizing and describing some existing DL-based feedback works. The focus is on novel neural network architectures and utilization of communication expert knowledge to improve CSI feedback accuracy. Works on bit-level CSI feedback and joint design of CSI feedback with other communication modules are also introduced, and some practical issues, including training dataset collection, online training, complexity, generalization, and standardization effect, are discussed. At the end of the paper, some challenges and potential research directions associated with DL-based CSI feedback in future wireless communication systems are identified.
    Reinforcement Learning in Medical Image Analysis: Concepts, Applications, Challenges, and Future Directions. (arXiv:2206.14302v1 [cs.CV])
    Motivation: Medical image analysis involves tasks to assist physicians in qualitative and quantitative analysis of lesions or anatomical structures, significantly improving the accuracy and reliability of diagnosis and prognosis. Traditionally, these tasks are finished by physicians or medical physicists and lead to two major problems: (i) low efficiency; (ii) biased by personal experience. In the past decade, many machine learning methods have been applied to accelerate and automate the image analysis process. Compared to the enormous deployments of supervised and unsupervised learning models, attempts to use reinforcement learning in medical image analysis are scarce. This review article could serve as the stepping-stone for related research. Significance: From our observation, though reinforcement learning has gradually gained momentum in recent years, many researchers in the medical analysis field find it hard to understand and deploy in clinics. One cause is lacking well-organized review articles targeting readers lacking professional computer science backgrounds. Rather than providing a comprehensive list of all reinforcement learning models in medical image analysis, this paper may help the readers to learn how to formulate and solve their medical image analysis research as reinforcement learning problems. Approach & Results: We selected published articles from Google Scholar and PubMed. Considering the scarcity of related articles, we also included some outstanding newest preprints. The papers are carefully reviewed and categorized according to the type of image analysis task. We first review the basic concepts and popular models of reinforcement learning. Then we explore the applications of reinforcement learning models in landmark detection. Finally, we conclude the article by discussing the reviewed reinforcement learning approaches' limitations and possible improvements.
    On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method. (arXiv:2206.14796v1 [cs.CL])
    Most works on modeling the conversation history in Conversational Question Answering (CQA) report a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g. from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach, and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy-to-plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.
    Applications of Reinforcement Learning in Finance -- Trading with a Double Deep Q-Network. (arXiv:2206.14267v1 [cs.LG])
    This paper presents a Double Deep Q-Network algorithm for trading single assets, namely the E-mini S&P 500 continuous futures contract. We use a proven setup as the foundation for our environment with multiple extensions. The features of our trading agent are constantly being expanded to include additional assets such as commodities, resulting in four models. We also respond to environmental conditions, including costs and crises. Our trading agent is first trained for a specific time period and tested on new data and compared with the long-and-hold strategy as a benchmark (market). We analyze the differences between the various models and the in-sample/out-of-sample performance with respect to the environment. The experimental results show that the trading agent follows an appropriate behavior. It can adjust its policy to different circumstances, such as more extensive use of the neutral position when trading costs are present. Furthermore, the net asset value exceeded that of the benchmark, and the agent outperformed the market in the test set. We provide initial insights into the behavior of an agent in a financial domain using a DDQN algorithm. The results of this study can be used for further development.
    Fair Machine Learning in Healthcare: A Review. (arXiv:2206.14397v1 [cs.LG])
    Benefiting from the digitization of healthcare data and the development of computing power, machine learning methods are increasingly used in the healthcare domain. Fairness problems have been identified in machine learning for healthcare, resulting in an unfair allocation of limited healthcare resources or excessive health risks for certain groups. Therefore, addressing the fairness problems has recently attracted increasing attention from the healthcare community. However, the intersection of machine learning for healthcare and fairness in machine learning remains understudied. In this review, we build the bridge by exposing fairness problems, summarizing possible biases, sorting out mitigation methods and pointing out challenges along with opportunities for the future.
    Predicting the Need for Blood Transfusion in Intensive Care Units with Reinforcement Learning. (arXiv:2206.14198v1 [cs.LG])
    As critically ill patients frequently develop anemia or coagulopathy, transfusion of blood products is a frequent intervention in the Intensive Care Units (ICU). However, inappropriate transfusion decisions made by physicians are often associated with increased risk of complications and higher hospital costs. In this work, we aim to develop a decision support tool that uses available patient information for transfusion decision-making on three common blood products (red blood cells, platelets, and fresh frozen plasma). To this end, we adopt an off-policy batch reinforcement learning (RL) algorithm, namely, discretized Batch Constrained Q-learning, to determine the best action (transfusion or not) given observed patient trajectories. Simultaneously, we consider different state representation approaches and reward design mechanisms to evaluate their impacts on policy learning. Experiments are conducted on two real-world critical care datasets: the MIMIC-III and the UCSF. Results demonstrate that policy recommendations on transfusion achieved comparable matching against true hospital policies via accuracy and weighted importance sampling evaluations on the MIMIC-III dataset. Furthermore, a combination of transfer learning (TL) and RL on the data-scarce UCSF dataset can provide up to $17.02% improvement in terms of accuracy, and up to 18.94% and 21.63% improvement in jump-start and asymptotic performance in terms of weighted importance sampling averaged over three transfusion tasks. Finally, simulations on transfusion decisions suggest that the transferred RL policy could reduce patients' estimated 28-day mortality rate by 2.74% and decreased acuity rate by 1.18% on the UCSF dataset.
    Beyond neural scaling laws: beating power law scaling via data pruning. (arXiv:2206.14486v1 [cs.LG])
    Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet. Given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.
    ECG Heartbeat classification using deep transfer learning with Convolutional Neural Network and STFT technique. (arXiv:2206.14200v1 [cs.LG])
    Electrocardiogram (ECG) is a simple non-invasive measure to identify heart-related issues such as irregular heartbeats known as arrhythmias. While artificial intelligence and machine learning is being utilized in a wide range of healthcare related applications and datasets, many arrhythmia classifiers using deep learning methods have been proposed in recent years. However, sizes of the available datasets from which to build and assess machine learning models is often very small and the lack of well-annotated public ECG datasets is evident. In this paper, we propose a deep transfer learning framework that is aimed to perform classification on a small size training dataset. The proposed method is to fine-tune a general-purpose image classifier ResNet-18 with MIT-BIH arrhythmia dataset in accordance with the AAMI EC57 standard. This paper further investigates many existing deep learning models that have failed to avoid data leakage against AAMI recommendations. We compare how different data split methods impact the model performance. This comparison study implies that future work in arrhythmia classification should follow the AAMI EC57 standard when using any including MIT-BIH arrhythmia dataset.
    Knowledge Graph Fusion for Language Model Fine-tuning. (arXiv:2206.14574v1 [cs.CL])
    Language Models such as BERT have grown in popularity due to their ability to be pre-trained and perform robustly on a wide range of Natural Language Processing tasks. Often seen as an evolution over traditional word embedding techniques, they can produce semantic representations of text, useful for tasks such as semantic similarity. However, state-of-the-art models often have high computational requirements and lack global context or domain knowledge which is required for complete language understanding. To address these limitations, we investigate the benefits of knowledge incorporation into the fine-tuning stages of BERT. An existing K-BERT model, which enriches sentences with triplets from a Knowledge Graph, is adapted for the English language and extended to inject contextually relevant information into sentences. As a side-effect, changes made to K-BERT for accommodating the English language also extend to other word-based languages. Experiments conducted indicate that injected knowledge introduces noise. We see statistically significant improvements for knowledge-driven tasks when this noise is minimised. We show evidence that, given the appropriate task, modest injection with relevant, high-quality knowledge is most performant.
    Generative Anomaly Detection for Time Series Datasets. (arXiv:2206.14597v1 [cs.LG])
    Traffic congestion anomaly detection is of paramount importance in intelligent traffic systems. The goals of transportation agencies are two-fold: to monitor the general traffic conditions in the area of interest and to locate road segments under abnormal congestion states. Modeling congestion patterns can achieve these goals for citywide roadways, which amounts to learning the distribution of multivariate time series (MTS). However, existing works are either not scalable or unable to capture the spatial-temporal information in MTS simultaneously. To this end, we propose a principled and comprehensive framework consisting of a data-driven generative approach that can perform tractable density estimation for detecting traffic anomalies. Our approach first clusters segments in the feature space and then uses conditional normalizing flow to identify anomalous temporal snapshots at the cluster level in an unsupervised setting. Then, we identify anomalies at the segment level by using a kernel density estimator on the anomalous cluster. Extensive experiments on synthetic datasets show that our approach significantly outperforms several state-of-the-art congestion anomaly detection and diagnosis methods in terms of Recall and F1-Score. We also use the generative model to sample labeled data, which can train classifiers in a supervised setting, alleviating the lack of labeled data for anomaly detection in sparse settings.
    Cooperative Retriever and Ranker in Deep Recommenders. (arXiv:2206.14649v1 [cs.IR])
    Deep recommender systems jointly leverage the retrieval and ranking operations to generate the recommendation result. The retriever targets selecting a small set of relevant candidates from the entire items with high efficiency; while the ranker, usually more precise but time-consuming, is supposed to identify the best items out of the retrieved candidates with high precision. However, the retriever and ranker are usually trained in poorly-cooperative ways, leading to limited recommendation performances when working as an entirety. In this work, we propose a novel DRS training framework CoRR(short for Cooperative Retriever and Ranker), where the retriever and ranker can be mutually reinforced. On one hand, the retriever is learned from recommendation data and the ranker via knowledge distillation; knowing that the ranker is more precise, the knowledge distillation may provide extra weak-supervision signals for the improvement of retrieval quality. On the other hand, the ranker is trained by learning to discriminate the truth positive items from hard negative candidates sampled from the retriever. With the iteration going on, the ranker may become more precise, which in return gives rise to informative training signals for the retriever; meanwhile, with the improvement of retriever, harder negative candidates can be sampled, which contributes to a higher discriminative capability of the ranker. To facilitate the effective conduct of CoRR, an asymptotic-unbiased approximation of KL divergence is introduced for the knowledge distillation over sampled items; besides, a scalable and adaptive strategy is developed to efficiently sample from the retriever. Comprehensive experimental studies are performed over four large-scale benchmark datasets, where CoRR improves the overall recommendation quality resulting from the cooperation between retriever and ranker.
    Semi-supervised Contrastive Outlier removal for Pseudo Expectation Maximization (SCOPE). (arXiv:2206.14261v1 [cs.LG])
    Semi-supervised learning is the problem of training an accurate predictive model by combining a small labeled dataset with a presumably much larger unlabeled dataset. Many methods for semi-supervised deep learning have been developed, including pseudolabeling, consistency regularization, and contrastive learning techniques. Pseudolabeling methods however are highly susceptible to confounding, in which erroneous pseudolabels are assumed to be true labels in early iterations, thereby causing the model to reinforce its prior biases and thereby fail to generalize to strong predictive performance. We present a new approach to suppress confounding errors through a method we describe as Semi-supervised Contrastive Outlier removal for Pseudo Expectation Maximization (SCOPE). Like basic pseudolabeling, SCOPE is related to Expectation Maximization (EM), a latent variable framework which can be extended toward understanding cluster-assumption deep semi-supervised algorithms. However, unlike basic pseudolabeling which fails to adequately take into account the probability of the unlabeled samples given the model, SCOPE introduces an outlier suppression term designed to improve the behavior of EM iteration given a discrimination DNN backbone in the presence of outliers. Our results show that SCOPE greatly improves semi-supervised classification accuracy over a baseline, and furthermore when combined with consistency regularization achieves the highest reported accuracy for the semi-supervised CIFAR-10 classification task using 250 and 4000 labeled samples. Moreover, we show that SCOPE reduces the prevalence of confounding errors during pseudolabeling iterations by pruning erroneous high-confidence pseudolabeled samples that would otherwise contaminate the labeled set in subsequent retraining iterations.
    SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences. (arXiv:2206.14550v1 [cs.AR])
    The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.
    Computer-aided diagnosis and prediction in brain disorders. (arXiv:2206.14683v1 [cs.LG])
    Computer-aided methods have shown added value for diagnosing and predicting brain disorders and can thus support decision making in clinical care and treatment planning. This chapter will provide insight into the type of methods, their working, their input data - such as cognitive tests, imaging and genetic data - and the types of output they provide. We will focus on specific use cases for diagnosis, i.e. estimating the current 'condition' of the patient, such as early detection and diagnosis of dementia, differential diagnosis of brain tumours, and decision making in stroke. Regarding prediction, i.e. estimation of the future 'condition' of the patient, we will zoom in on use cases such as predicting the disease course in multiple sclerosis and predicting patient outcomes after treatment in brain cancer. Furthermore, based on these use cases, we will assess the current state-of-the-art methodology and highlight current efforts on benchmarking of these methods and the importance of open science therein. Finally, we assess the current clinical impact of computer-aided methods and discuss the required next steps to increase clinical impact.  ( 2 min )
    Open Problem: Properly learning decision trees in polynomial time?. (arXiv:2206.14431v1 [cs.DS])
    The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.
    Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?. (arXiv:2206.14532v1 [cs.LG])
    This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019) and Shen et al. (2021b). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question -- to smooth or not to smooth a teacher network? -- unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students. Code and models are available at https://keshik6.github.io/revisiting-ls-kd-compatibility/
    Can Push-forward Generative Models Fit Multimodal Distributions?. (arXiv:2206.14476v1 [stat.ML])
    Many generative models synthesize data by transforming a standard Gaussian random variable using a deterministic neural network. Among these models are the Variational Autoencoders and the Generative Adversarial Networks. In this work, we call them "push-forward" models and study their expressivity. We show that the Lipschitz constant of these generative networks has to be large in order to fit multimodal distributions. More precisely, we show that the total variation distance and the Kullback-Leibler divergence between the generated and the data distribution are bounded from below by a constant depending on the mode separation and the Lipschitz constant. Since constraining the Lipschitz constants of neural networks is a common way to stabilize generative models, there is a provable trade-off between the ability of push-forward models to approximate multimodal distributions and the stability of their training. We validate our findings on one-dimensional and image datasets and empirically show that generative models consisting of stacked networks with stochastic input at each step, such as diffusion models do not suffer of such limitations.
    Target alignment in truncated kernel ridge regression. (arXiv:2206.14255v1 [cs.LG])
    Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an \emph{over-aligned} regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the \emph{alignment spectrum} and the generalization performance of kernel methods, both in terms of rates and in finite samples.
    Adversarial Ensemble Training by Jointly Learning Label Dependencies and Member Models. (arXiv:2206.14477v1 [cs.LG])
    Training an ensemble of different sub-models has empirically proven to be an effective strategy to improve deep neural networks' adversarial robustness. Current ensemble training methods for image recognition usually encode the image labels by one-hot vectors, which neglect dependency relationships between the labels. Here we propose a novel adversarial training approach that learns the conditional dependencies between labels and the model ensemble jointly. We test our approach on widely used datasets MNIST, FasionMNIST and CIFAR-10. Results show that our approach is more robust against black-box attacks compared with state-of-the-art methods. Our code is available at https://github.com/ZJLAB-AMMI/LSD.
    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v1 [stat.ML])
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.
    Towards Traffic Scene Description: The Semantic Scene Graph. (arXiv:2111.10196v2 [cs.LG] UPDATED)
    For the classification of traffic scenes, a description model is necessary that can describe the scene in a uniform way, independent of its domain. A model to describe a traffic scene in a semantic way is described in this paper. The description model allows to describe a traffic scene independently of the road geometry and road topology. Here, the traffic participants are projected onto the road network and represented as nodes in a graph. Depending on the relative location between two traffic participants with respect to the road topology, semantically classified edges are created between the corresponding nodes. For concretization, the edge attributes are extended by relative distances and velocities between both traffic participants with regard to the course of the lane. An important aspect of the description is that it can be converted easily into a machine-readable format. The current description focuses on dynamic objects of a traffic scene and considers traffic participants, such as pedestrians or vehicles.
    Multiresolution Equivariant Graph Variational Autoencoder. (arXiv:2106.00967v3 [cs.LG] UPDATED)
    In this paper, we propose Multiresolution Equivariant Graph Variational Autoencoders (MGVAE), the first hierarchical generative model to learn and generate graphs in a multiresolution and equivariant manner. At each resolution level, MGVAE employs higher order message passing to encode the graph while learning to partition it into mutually exclusive clusters and coarsening into a lower resolution that eventually creates a hierarchy of latent distributions. MGVAE then constructs a hierarchical generative model to variationally decode into a hierarchy of coarsened graphs. Importantly, our proposed framework is end-to-end permutation equivariant with respect to node ordering. MGVAE achieves competitive results with several generative tasks including general graph generation, molecular generation, unsupervised molecular representation learning to predict molecular properties, link prediction on citation graphs, and graph-based image generation.
    Building Matters: Spatial Variability in Machine Learning Based Thermal Comfort Prediction in Winters. (arXiv:2206.14202v1 [cs.LG])
    Thermal comfort in indoor environments has an enormous impact on the health, well-being, and performance of occupants. Given the focus on energy efficiency and Internet-of-Things enabled smart buildings, machine learning (ML) is being increasingly used for data-driven thermal comfort (TC) prediction. Generally, ML-based solutions are proposed for air-conditioned or HVAC ventilated buildings and the models are primarily designed for adults. On the other hand, naturally ventilated (NV) buildings are the norm in most countries. They are also ideal for energy conservation and long-term sustainability goals. However, the indoor environment of NV buildings lacks thermal regulation and varies significantly across spatial contexts. These factors make TC prediction extremely challenging. Thus, determining the impact of the building environment on the performance of TC models is important. Further, the generalization capability of TC prediction models across different NV indoor spaces needs to be studied. This work addresses these problems. Data is gathered through month-long field experiments conducted in 5 naturally ventilated school buildings, involving 512 primary school students. The impact of spatial variability on student comfort is demonstrated through variation in prediction accuracy (by as much as 71%). The influence of building environment on TC prediction is also demonstrated through variation in feature importance. Further, a comparative analysis of spatial variability in model performance is done for children (our dataset) and adults (ASHRAE-II database). Finally, the generalization capability of thermal comfort models in NV classrooms is assessed and major challenges are highlighted.
    Massively Increasing the number of Antibody-Virus Interactions across Studies. (arXiv:2206.14566v1 [q-bio.QM])
    A central challenge in every field of biology is to use existing measurements to predict the outcomes of future experiments. In this work, we consider the wealth of antibody inhibition data against variants of the influenza virus. Due to this virus's genetic diversity and evolvability, the variants examined in one study will often have little-to-no overlap with other studies, making it difficult to discern common patterns or unify datasets for further analysis. To that end, we develop a computational framework that predicts how an antibody or serum would inhibit any variant from any other study. We use this framework to greatly expand 7 influenza datasets utilizing hemagglutination inhibition, validating our method upon 200,000 existing measurements and predicting more than 2,000,000 new values along with their prediction uncertainties. This data-driven approach does not require any information beyond each virus's name and measurements, and even datasets with as few as 5 viruses can be expanded, making this approach widely applicable. Future influenza studies using hemagglutination inhibition can directly utilize our curated datasets to predict newly measured antibody responses against ~80 H3N2 influenza viruses from 1968-2011, whereas immunological studies utilizing other viruses or a different assay only need to find a single partially-overlapping dataset to extend their work. In essence, this approach enables a shift in perspective when analyzing data from "what you see is what you get" into "what anyone sees is what everyone gets."
    Learning Time Delay Systems with Neural Ordinary Differential Equations. (arXiv:2206.14288v1 [cs.LG])
    A novel way of using neural networks to learn the dynamics of time delay systems from sequential data is proposed. A neural network with trainable delays is used to approximate the right hand side of a delay differential equation. We relate the delay differential equation to an ordinary differential equation by discretizing the time history and train the corresponding neural ordinary differential equation (NODE) to learn the dynamics. An example on learning the dynamics of the Mackey-Glass equation using data from chaotic behavior is given. After learning both the nonlinearity and the time delay, we demonstrate that the bifurcation diagram of the neural network matches that of the original system.
    Collecting high-quality adversarial data for machine reading comprehension tasks with humans and models in the loop. (arXiv:2206.14272v1 [cs.CL])
    We present our experience as annotators in the creation of high-quality, adversarial machine-reading-comprehension data for extractive QA for Task 1 of the First Workshop on Dynamic Adversarial Data Collection (DADC). DADC is an emergent data collection paradigm with both models and humans in the loop. We set up a quasi-experimental annotation design and perform quantitative analyses across groups with different numbers of annotators focusing on successful adversarial attacks, cost analysis, and annotator confidence correlation. We further perform a qualitative analysis of our perceived difficulty of the task given the different topics of the passages in our dataset and conclude with recommendations and suggestions that might be of value to people working on future DADC tasks and related annotation interfaces.
    A Perturbation Bound on the Subspace Estimator from Canonical Projections. (arXiv:2206.14278v1 [stat.ML])
    This paper derives a perturbation bound on the optimal subspace estimator obtained from a subset of its canonical projections contaminated by noise. This fundamental result has important implications in matrix completion, subspace clustering, and related problems.
    Framing Algorithmic Recourse for Anomaly Detection. (arXiv:2206.14384v1 [cs.LG])
    The problem of algorithmic recourse has been explored for supervised machine learning models, to provide more interpretable, transparent and robust outcomes from decision support systems. An unexplored area is that of algorithmic recourse for anomaly detection, specifically for tabular data with only discrete feature values. Here the problem is to present a set of counterfactuals that are deemed normal by the underlying anomaly detection model so that applications can utilize this information for explanation purposes or to recommend countermeasures. We present an approach -- Context preserving Algorithmic Recourse for Anomalies in Tabular data (CARAT), that is effective, scalable, and agnostic to the underlying anomaly detection model. CARAT uses a transformer based encoder-decoder model to explain an anomaly by finding features with low likelihood. Subsequently semantically coherent counterfactuals are generated by modifying the highlighted features, using the overall context of features in the anomalous instance(s). Extensive experiments help demonstrate the efficacy of CARAT.  ( 2 min )
    Extracting Weighted Finite Automata from Recurrent Neural Networks for Natural Languages. (arXiv:2206.14621v1 [cs.CL])
    Recurrent Neural Networks (RNNs) have achieved tremendous success in sequential data processing. However, it is quite challenging to interpret and verify RNNs' behaviors directly. To this end, many efforts have been made to extract finite automata from RNNs. Existing approaches such as exact learning are effective in extracting finite-state models to characterize the state dynamics of RNNs for formal languages, but are limited in the scalability to process natural languages. Compositional approaches that are scablable to natural languages fall short in extraction precision. In this paper, we identify the transition sparsity problem that heavily impacts the extraction precision. To address this problem, we propose a transition rule extraction approach, which is scalable to natural language processing models and effective in improving extraction precision. Specifically, we propose an empirical method to complement the missing rules in the transition diagram. In addition, we further adjust the transition matrices to enhance the context-aware ability of the extracted weighted finite automaton (WFA). Finally, we propose two data augmentation tactics to track more dynamic behaviors of the target RNN. Experiments on two popular natural language datasets show that our method can extract WFA from RNN for natural language processing with better precision than existing approaches.  ( 2 min )
    Cross-Silo Heterogeneous Model Federated Multitask Learning. (arXiv:2202.08603v3 [cs.LG] UPDATED)
    Federated learning (FL) is a machine learning technique that enables participants to collaboratively train high-quality models without exchanging their private data. Participants utilizing cross-silo FL (CS-FL) settings are independent organizations with different task needs, and they are concerned not only with data privacy but also with independently training their unique models due to intellectual property considerations. Most existing FL methods are incapable of satisfying the above scenarios. In this paper, we propose a FL method based on the pseudolabeling of unlabeled data via a process such as cotraining. To the best of our knowledge, this is the first FL method that is simultaneously compatible with heterogeneous tasks, heterogeneous models, and heterogeneous training algorithms. Experimental results show that the proposed method achieves better performance than competing ones. This is especially true for non-independent and identically distributed (IID) settings and heterogeneous models, where the proposed method achieves a 35% performance improvement.
    Forgetting Data from Pre-trained GANs. (arXiv:2206.14389v1 [cs.LG])
    Large pre-trained generative models are known to occasionally provide samples that may be undesirable for various reasons. The standard way to mitigate this is to re-train the models differently. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it forgets certain kinds of samples. We provide three different algorithms for GANs that differ on how the samples to be forgotten are described. Extensive evaluations on real-world image datasets show that our algorithms are capable of forgetting data while retaining high generation quality at a fraction of the cost of full re-training.
    Active Exploration via Experiment Design in Markov Chains. (arXiv:2206.14332v1 [cs.LG])
    A key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a {\em Markov chain}, and we can only choose them by selecting a {\em policy} controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm -- \textsc{markov-design} -- that efficiently selects policies whose measurement allocation \emph{provably converges to the optimal one}. The algorithm is sequential in nature, adapting its choice of policies (experiments) informed by past measurements. In addition to our theoretical analysis, we showcase our framework on applications in ecological surveillance and pharmacology.
    Towards Robust Waveform-Based Acoustic Models. (arXiv:2110.08634v2 [cs.SD] UPDATED)
    We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveform-based setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.
    Cyclical Kernel Adaptive Metropolis. (arXiv:2206.14421v1 [cs.LG])
    We propose cKAM, cyclical Kernel Adaptive Metropolis, which incorporates a cyclical stepsize scheme to allow control for exploration and sampling. We show that on a crafted bimodal distribution, existing Adaptive Metropolis type algorithms would fail to converge to the true posterior distribution. We point out that this is because adaptive samplers estimates the local/global covariance structure using past history of the chain, which will lead to adaptive algorithms be trapped in a local mode. We demonstrate that cKAM encourages exploration of the posterior distribution and allows the sampler to escape from a local mode, while maintaining the high performance of adaptive methods.
    DDKtor: Automatic Diadochokinetic Speech Analysis. (arXiv:2206.14639v1 [eess.AS])
    Diadochokinetic speech tasks (DDK), in which participants repeatedly produce syllables, are commonly used as part of the assessment of speech motor impairments. These studies rely on manual analyses that are time-intensive, subjective, and provide only a coarse-grained picture of speech. This paper presents two deep neural network models that automatically segment consonants and vowels from unannotated, untranscribed speech. Both models work on the raw waveform and use convolutional layers for feature extraction. The first model is based on an LSTM classifier followed by fully connected layers, while the second model adds more convolutional layers followed by fully connected layers. These segmentations predicted by the models are used to obtain measures of speech rate and sound duration. Results on a young healthy individuals dataset show that our LSTM model outperforms the current state-of-the-art systems and performs comparably to trained human annotators. Moreover, the LSTM model also presents comparable results to trained human annotators when evaluated on unseen older individuals with Parkinson's Disease dataset.
    Approximate Data Deletion in Generative Models. (arXiv:2206.14439v1 [cs.LG])
    Users have the right to have their data deleted by third-party learned systems, as codified by recent legislation such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Such data deletion can be accomplished by full re-training, but this incurs a high computational cost for modern machine learning models. To avoid this cost, many approximate data deletion methods have been developed for supervised learning. Unsupervised learning, in contrast, remains largely an open problem when it comes to (approximate or exact) efficient data deletion. In this paper, we propose a density-ratio-based framework for generative models. Using this framework, we introduce a fast method for approximate data deletion and a statistical test for estimating whether or not training points have been deleted. We provide theoretical guarantees under various learner assumptions and empirically demonstrate our methods across a variety of generative methods.
    On the power of adaptivity in statistical adversaries. (arXiv:2111.10352v2 [cs.LG] UPDATED)
    We study a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution $\mathcal{D}$. The definitions of these adversaries specify the type of allowable corruptions (noise model) as well as when these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution $\mathcal{D}$ and adaptive adversaries that can have their corruptions depend on the specific sample $S$ that is drawn from $\mathcal{D}$. In this work, we investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature. Specifically, can the behavior of an algorithm $\mathcal{A}$ in the presence of oblivious adversaries always be well-approximated by that of an algorithm $\mathcal{A}'$ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of statistical query algorithms, under all reasonable noise models. We then show that in the specific case of additive noise, this equivalence holds for all algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.
    Modeling Teams Performance Using Deep Representational Learning on Graphs. (arXiv:2206.14741v1 [cs.SI])
    The large majority of human activities require collaborations within and across formal or informal teams. Our understanding of how the collaborative efforts spent by teams relate to their performance is still a matter of debate. Teamwork results in a highly interconnected ecosystem of potentially overlapping components where tasks are performed in interaction with team members and across other teams. To tackle this problem, we propose a graph neural network model designed to predict a team's performance while identifying the drivers that determine such an outcome. In particular, the model is based on three architectural channels: topological, centrality, and contextual which capture different factors potentially shaping teams' success. We endow the model with two attention mechanisms to boost model performance and allow interpretability. A first mechanism allows pinpointing key members inside the team. A second mechanism allows us to quantify the contributions of the three driver effects in determining the outcome performance. We test model performance on a wide range of domains outperforming most of the classical and neural baselines considered. Moreover, we include synthetic datasets specifically designed to validate how the model disentangles the intended properties on which our model vastly outperforms baselines.
    MurTree: Optimal Classification Trees via Dynamic Programming and Search. (arXiv:2007.12652v4 [cs.LG] UPDATED)
    Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical realisation of optimal decision trees.
    An extensible Benchmarking Graph-Mesh dataset for studying Steady-State Incompressible Navier-Stokes Equations. (arXiv:2206.14709v1 [cs.LG])
    Recent progress in \emph{Geometric Deep Learning} (GDL) has shown its potential to provide powerful data-driven models. This gives momentum to explore new methods for learning physical systems governed by \emph{Partial Differential Equations} (PDEs) from Graph-Mesh data. However, despite the efforts and recent achievements, several research directions remain unexplored and progress is still far from satisfying the physical requirements of real-world phenomena. One of the major impediments is the absence of benchmarking datasets and common physics evaluation protocols. In this paper, we propose a 2-D graph-mesh dataset to study the airflow over airfoils at high Reynolds regime (from $10^6$ and beyond). We also introduce metrics on the stress forces over the airfoil in order to evaluate GDL models on important physical quantities. Moreover, we provide extensive GDL baselines.
    Cut Inner Layers: A Structured Pruning Strategy for Efficient U-Net GANs. (arXiv:2206.14658v1 [cs.LG])
    Pruning effectively compresses overparameterized models. Despite the success of pruning methods for discriminative models, applying them for generative models has been relatively rarely approached. This study conducts structured pruning on U-Net generators of conditional GANs. A per-layer sensitivity analysis confirms that many unnecessary filters exist in the innermost layers near the bottleneck and can be substantially pruned. Based on this observation, we prune these filters from multiple inner layers or suggest alternative architectures by completely eliminating the layers. We evaluate our approach with Pix2Pix for image-to-image translation and Wav2Lip for speech-driven talking face generation. Our method outperforms global pruning baselines, demonstrating the importance of properly considering where to prune for U-Net generators.
    IBP Regularization for Verified Adversarial Robustness via Branch-and-Bound. (arXiv:2206.14772v1 [cs.LG])
    Recent works have tried to increase the verifiability of adversarially trained networks by running the attacks over domains larger than the original perturbations and adding various regularization terms to the objective. However, these algorithms either underperform or require complex and expensive stage-wise training procedures, hindering their practical applicability. We present IBP-R, a novel verified training algorithm that is both simple and effective. IBP-R induces network verifiability by coupling adversarial attacks on enlarged domains with a regularization term, based on inexpensive interval bound propagation, that minimizes the gap between the non-convex verification problem and its approximations. By leveraging recent branch-and-bound frameworks, we show that IBP-R obtains state-of-the-art verified robustness-accuracy trade-offs for small perturbations on CIFAR-10 while training significantly faster than relevant previous work. Additionally, we present UPB, a novel branching strategy that, relying on a simple heuristic based on $\beta$-CROWN, reduces the cost of state-of-the-art branching algorithms while yielding splits of comparable quality.
    On-device Synaptic Memory Consolidation using Fowler-Nordheim Quantum-tunneling. (arXiv:2206.14581v1 [cs.ET])
    Synaptic memory consolidation has been heralded as one of the key mechanisms for supporting continual learning in neuromorphic Artificial Intelligence (AI) systems. Here we report that a Fowler-Nordheim (FN) quantum-tunneling device can implement synaptic memory consolidation similar to what can be achieved by algorithmic consolidation models like the cascade and the elastic weight consolidation (EWC) models. The proposed FN-synapse not only stores the synaptic weight but also stores the synapse's historical usage statistic on the device itself. We also show that the operation of the FN-synapse is near-optimal in terms of the synaptic lifetime and we demonstrate that a network comprising FN-synapses outperforms a comparable EWC network for a small benchmark continual learning task. With an energy footprint of femtojoules per synaptic update, we believe that the proposed FN-synapse provides an ultra-energy-efficient approach for implementing both synaptic memory consolidation and persistent learning.
    From Kernel Methods to Neural Networks: A Unifying Variational Formulation. (arXiv:2206.14625v1 [cs.LG])
    The minimization of a data-fidelity term and an additive regularization functional gives rise to a powerful framework for supervised learning. In this paper, we present a unifying regularization functional that depends on an operator and on a generic Radon-domain norm. We establish the existence of a minimizer and give the parametric form of the solution(s) under very mild assumptions. When the norm is Hilbertian, the proposed formulation yields a solution that involves radial-basis functions and is compatible with the classical methods of machine learning. By contrast, for the total-variation norm, the solution takes the form of a two-layer neural network with an activation function that is determined by the regularization operator. In particular, we retrieve the popular ReLU networks by letting the operator be the Laplacian. We also characterize the solution for the intermediate regularization norms $\|\cdot\|=\|\cdot\|_{L_p}$ with $p\in(1,2]$. Our framework offers guarantees of universal approximation for a broad family of regularization operators or, equivalently, for a wide variety of shallow neural networks, including the cases (such as ReLU) where the activation function is increasing polynomially. It also explains the favorable role of bias and skip connections in neural architectures.
    Hardness and Algorithms for Robust and Sparse Optimization. (arXiv:2206.14354v1 [cs.LG])
    We explore algorithms and limitations for sparse optimization problems such as sparse linear regression and robust linear regression. The goal of the sparse linear regression problem is to identify a small number of key features, while the goal of the robust linear regression problem is to identify a small number of erroneous measurements. Specifically, the sparse linear regression problem seeks a $k$-sparse vector $x\in\mathbb{R}^d$ to minimize $\|Ax-b\|_2$, given an input matrix $A\in\mathbb{R}^{n\times d}$ and a target vector $b\in\mathbb{R}^n$, while the robust linear regression problem seeks a set $S$ that ignores at most $k$ rows and a vector $x$ to minimize $\|(Ax-b)_S\|_2$. We first show bicriteria, NP-hardness of approximation for robust regression building on the work of [OWZ15] which implies a similar result for sparse regression. We further show fine-grained hardness of robust regression through a reduction from the minimum-weight $k$-clique conjecture. On the positive side, we give an algorithm for robust regression that achieves arbitrarily accurate additive error and uses runtime that closely matches the lower bound from the fine-grained hardness result, as well as an algorithm for sparse regression with similar runtime. Both our upper and lower bounds rely on a general reduction from robust linear regression to sparse regression that we introduce. Our algorithms, inspired by the 3SUM problem, use approximate nearest neighbor data structures and may be of independent interest for solving sparse optimization problems. For instance, we demonstrate that our techniques can also be used for the well-studied sparse PCA problem.
    Deformable Graph Transformer. (arXiv:2206.14337v1 [cs.LG])
    Transformer-based models have been widely used and achieved state-of-the-art performance in various domains such as natural language processing and computer vision. Recent works show that Transformers can also be generalized to graph-structured data. However, the success is limited to small-scale graphs due to technical challenges such as the quadratic complexity in regards to the number of nodes and non-local aggregation that often leads to inferior generalization performance to conventional graph neural networks. In this paper, to address these issues, we propose Deformable Graph Transformer (DGT) that performs sparse attention with dynamically sampled key and value pairs. Specifically, our framework first constructs multiple node sequences with various criteria to consider both structural and semantic proximity. Then, the sparse attention is applied to the node sequences for learning node representations with a reduced computational cost. We also design simple and effective positional encodings to capture structural similarity and distance between nodes. Experiments demonstrate that our novel graph Transformer consistently outperforms existing Transformer-based models and shows competitive performance compared to state-of-the-art models on 8 graph benchmark datasets including large-scale graphs.
    Two-Stage Neural Contextual Bandits for Personalised News Recommendation. (arXiv:2206.14648v1 [cs.IR])
    We consider the problem of personalised news recommendation where each user consumes news in a sequential fashion. Existing personalised news recommendation methods focus on exploiting user interests and ignores exploration in recommendation, which leads to biased feedback loops and hurt recommendation quality in the long term. We build on contextual bandits recommendation strategies which naturally address the exploitation-exploration trade-off. The main challenges are the computational efficiency for exploring the large-scale item space and utilising the deep representations with uncertainty. We propose a two-stage hierarchical topic-news deep contextual bandits framework to efficiently learn user preferences when there are many news items. We use deep learning representations for users and news, and generalise the neural upper confidence bound (UCB) policies to generalised additive UCB and bilinear UCB. Empirical results on a large-scale news recommendation dataset show that our proposed policies are efficient and outperform the baseline bandit policies.
    No imputation without representation. (arXiv:2206.14254v1 [cs.LG])
    By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. We find that on these datasets, missing-indicators generally increase classification performance. In addition, we find no evidence for most algorithms that nearest neighbour and iterative imputation lead to better performance than simple mean/mode imputation. Therefore, we recommend the use of missing-indicators with mean/mode imputation as a safe default, with the caveat that for decision trees, pruning is necessary to prevent overfitting. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance, and observe that these thresholds are much lower for categorical than for numerical attributes. Finally, we argue that mean imputation of numerical attributes may preserve some of the information from missing values, and we show that in the absence of missing-indicators, it can similarly be useful to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.
    Variational Quantum Approximate Support Vector Machine With Inference Transfer. (arXiv:2206.14507v1 [quant-ph])
    A kernel-based quantum classifier is the most interesting and powerful quantum machine learning technique for hyperlinear classification of complex data, which can be easily realized in shallow-depth quantum circuits such as a SWAP test classifier. Surprisingly, a support vector machine can be realized inherently and explicitly on these circuits by introduction of a variational scheme to map the quadratic optimization problem of the SVM theory to a quantum-classical variational optimization problem. This scheme is realized with parameterized quantum circuits (PQC) to create a nonuniform weight vector to index qubits that can evaluate training loss and classification score in a linear time. We train the classical parameters of this Variational Quantum Approximate Support Vector Machine (VQASVM), which can be transferred to many copies of other VQASVM decision inference circuits for classification of new query data. Our VQASVM algorithm is experimented with toy example data sets on cloud-based quantum machines for feasibility evaluation, and numerically investigated to evaluate its performance on a standard iris flower data set. The accuracy of iris data classification reached 98.8%.
    Multistep Automated Data Labelling Procedure (MADLaP) for Thyroid Nodules on Ultrasound: An Artificial Intelligence Approach for Automating Image Annotation. (arXiv:2206.14305v1 [eess.IV])
    Machine learning (ML) for diagnosis of thyroid nodules on ultrasound is an active area of research. However, ML tools require large, well-labelled datasets, the curation of which is time-consuming and labor-intensive. The purpose of our study was to develop and test a deep-learning-based tool to facilitate and automate the data annotation process for thyroid nodules; we named our tool Multistep Automated Data Labelling Procedure (MADLaP). MADLaP was designed to take multiple inputs included pathology reports, ultrasound images, and radiology reports. Using multiple step-wise modules including rule-based natural language processing, deep-learning-based imaging segmentation, and optical character recognition, MADLaP automatically identified images of a specific thyroid nodule and correctly assigned a pathology label. The model was developed using a training set of 378 patients across our health system and tested on a separate set of 93 patients. Ground truths for both sets were selected by an experienced radiologist. Performance metrics including yield (how many labeled images the model produced) and accuracy (percentage correct) were measured using the test set. MADLaP achieved a yield of 63% and an accuracy of 83%. The yield progressively increased as the input data moved through each module, while accuracy peaked part way through. Error analysis showed that inputs from certain examination sites had lower accuracy (40%) than the other sites (90%, 100%). MADLaP successfully created curated datasets of labeled ultrasound images of thyroid nodules. While accurate, the relatively suboptimal yield of MADLaP exposed some challenges when trying to automatically label radiology images from heterogeneous sources. The complex task of image curation and annotation could be automated, allowing for enrichment of larger datasets for use in machine learning development.
    Convolutional Neural Network Based Partial Face Detection. (arXiv:2206.14350v1 [cs.CV])
    Due to the massive explanation of artificial intelligence, machine learning technology is being used in various areas of our day-to-day life. In the world, there are a lot of scenarios where a simple crime can be prevented before it may even happen or find the person responsible for it. A face is one distinctive feature that we have and can differentiate easily among many other species. But not just different species, it also plays a significant role in determining someone from the same species as us, humans. Regarding this critical feature, a single problem occurs most often nowadays. When the camera is pointed, it cannot detect a person's face, and it becomes a poor image. On the other hand, where there was a robbery and a security camera installed, the robber's identity is almost indistinguishable due to the low-quality camera. But just making an excellent algorithm to work and detecting a face reduces the cost of hardware, and it doesn't cost that much to focus on that area. Facial recognition, widget control, and such can be done by detecting the face correctly. This study aims to create and enhance a machine learning model that correctly recognizes faces. Total 627 Data have been collected from different Bangladeshi people's faces on four angels. In this work, CNN, Harr Cascade, Cascaded CNN, Deep CNN & MTCNN are these five machine learning approaches implemented to get the best accuracy of our dataset. After creating and running the model, Multi-Task Convolutional Neural Network (MTCNN) achieved 96.2% best model accuracy with training data rather than other machine learning models.
    GERNERMED++: Transfer Learning in German Medical NLP. (arXiv:2206.14504v1 [cs.CL])
    We present a statistical model for German medical natural language processing trained for named entity recognition (NER) as an open, publicly available model. The work serves as a refined successor to our first GERNERMED model which is substantially outperformed by our work. We demonstrate the effectiveness of combining multiple techniques in order to achieve strong results in entity recognition performance by the means of transfer-learning on pretrained deep language models (LM), word-alignment and neural machine translation. Due to the sparse situation on open, public medical entity recognition models for German texts, this work offers benefits to the German research community on medical NLP as a baseline model. Since our model is based on public English data, its weights are provided without legal restrictions on usage and distribution. The sample code and the statistical model is available at: https://github.com/frankkramer-lab/GERNERMED-pp
    Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. (arXiv:2206.14381v1 [cs.CV])
    In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized Discounted Cumulative Gain (nDCG), which is more valuable for semantic similarity. Our submission is ranked 3rd for nDCG and ranked 4th for mAP.
    Using Twitter Data to Understand Public Perceptions of Approved versus Off-label Use for COVID-19-related Medications. (arXiv:2206.14358v1 [cs.CY])
    Understanding public discourse on emergency use of unproven therapeutics is essential to monitor safe use and combat misinformation. We developed a natural language processing (NLP)-based pipeline to understand public perceptions of and stances on COVID-19-related drugs on Twitter across time. This retrospective study included 609,189 US-based tweets between January 29th, 2020 and November 30th, 2021 on four drugs that gained wide public attention during the COVID-19 pandemic: 1) Hydroxychloroquine and Ivermectin, drug therapies with anecdotal evidence; and 2) Molnupiravir and Remdesivir, FDA-approved treatment options for eligible patients. Time-trend analysis was used to understand the popularity and related events. Content and demographic analyses were conducted to explore potential rationales of people's stances on each drug. Time-trend analysis revealed that Hydroxychloroquine and Ivermectin received much more discussion than Molnupiravir and Remdesivir, particularly during COVID-19 surges. Hydroxychloroquine and Ivermectin were highly politicized, related to conspiracy theories, hearsay, celebrity effects, etc. The distribution of stance between the two major US political parties was significantly different (p<0.001); Republicans were much more likely to support Hydroxychloroquine (+55%) and Ivermectin (+30%) than Democrats. People with healthcare backgrounds tended to oppose Hydroxychloroquine (+7%) more than the general population; in contrast, the general population was more likely to support Ivermectin (+14%). We make all the data, code, and models available at https://github.com/ningkko/COVID-drug.
    Gaussian Latent Dirichlet Allocation for Discrete Human State Discovery. (arXiv:2206.14233v1 [cs.LG])
    In this article we propose and validate an unsupervised probabilistic model, Gaussian Latent Dirichlet Allocation (GLDA), for the problem of discrete state discovery from repeated, multivariate psychophysiological samples collected from multiple, inherently distinct, individuals. Psychology and medical research heavily involves measuring potentially related but individually inconclusive variables from a cohort of participants to derive diagnosis, necessitating clustering analysis. Traditional probabilistic clustering models such as Gaussian Mixture Model (GMM) assume a global mixture of component distributions, which may not be realistic for observations from different patients. The GLDA model borrows the individual-specific mixture structure from a popular topic model Latent Dirichlet Allocation (LDA) in Natural Language Processing and merges it with the Gaussian component distributions of GMM to suit continuous type data. We implemented GLDA using STAN (a probabilistic modeling language) and applied it on two datasets, one containing Ecological Momentary Assessments (EMA) and the other heart measures from electrocardiogram and impedance cardiograph. We found that in both datasets the GLDA-learned class weights achieved significantly higher correlations with clinically assessed depression, anxiety, and stress scores than those produced by the baseline GMM. Our findings demonstrate the advantage of GLDA over conventional finite mixture models for human state discovery from repeated multivariate data, likely due to better characterization of potential underlying between-participant differences. Future work is required to validate the utility of this model on a broader range of applications.  ( 3 min )
    RegMixup: Mixup as a Regularizer Can Surprisingly Improve Accuracy and Out Distribution Robustness. (arXiv:2206.14502v1 [cs.LG])
    We show that the effectiveness of the well celebrated Mixup [Zhang et al., 2018] can be further improved if instead of using it as the sole learning objective, it is utilized as an additional regularizer to the standard cross-entropy loss. This simple change not only provides much improved accuracy but also significantly improves the quality of the predictive uncertainty estimation of Mixup in most cases under various forms of covariate shifts and out-of-distribution detection experiments. In fact, we observe that Mixup yields much degraded performance on detecting out-of-distribution samples possibly, as we show empirically, because of its tendency to learn models that exhibit high-entropy throughout; making it difficult to differentiate in-distribution samples from out-distribution ones. To show the efficacy of our approach (RegMixup), we provide thorough analyses and experiments on vision datasets (ImageNet & CIFAR-10/100) and compare it with a suite of recent approaches for reliable uncertainty estimation.  ( 2 min )
    Intrinsic Anomaly Detection for Multi-Variate Time Series. (arXiv:2206.14342v1 [cs.LG])
    We introduce a novel, practically relevant variation of the anomaly detection problem in multi-variate time series: intrinsic anomaly detection. It appears in diverse practical scenarios ranging from DevOps to IoT, where we want to recognize failures of a system that operates under the influence of a surrounding environment. Intrinsic anomalies are changes in the functional dependency structure between time series that represent an environment and time series that represent the internal state of a system that is placed in said environment. We formalize this problem, provide under-studied public and new purpose-built data sets for it, and present methods that handle intrinsic anomaly detection. These address the short-coming of existing anomaly detection methods that cannot differentiate between expected changes in the system's state and unexpected ones, i.e., changes in the system that deviate from the environment's influence. Our most promising approach is fully unsupervised and combines adversarial learning and time series representation learning, thereby addressing problems such as label sparsity and subjectivity, while allowing to navigate and improve notoriously problematic anomaly detection data sets.
    Comparative Study of Inference Methods for Interpolative Decomposition. (arXiv:2206.14542v1 [cs.LG])
    In this paper, we propose a probabilistic model with automatic relevance determination (ARD) for learning interpolative decomposition (ID), which is commonly used for low-rank approximation, feature selection, and identifying hidden patterns in data, where the matrix factors are latent variables associated with each data dimension. Prior densities with support on the specified subspace are used to address the constraint for the magnitude of the factored component of the observed matrix. Bayesian inference procedure based on Gibbs sampling is employed. We evaluate the model on a variety of real-world datasets including CCLE $EC50$, CCLE $IC50$, Gene Body Methylation, and Promoter Methylation datasets with different sizes, and dimensions, and show that the proposed Bayesian ID algorithms with automatic relevance determination lead to smaller reconstructive errors even compared to vanilla Bayesian ID algorithms with fixed latent dimension set to matrix rank.
    GAN-based Intrinsic Exploration For Sample Efficient Reinforcement Learning. (arXiv:2206.14256v1 [cs.LG])
    In this study, we address the problem of efficient exploration in reinforcement learning. Most common exploration approaches depend on random action selection, however these approaches do not work well in environments with sparse or no rewards. We propose Generative Adversarial Network-based Intrinsic Reward Module that learns the distribution of the observed states and sends an intrinsic reward that is computed as high for states that are out of distribution, in order to lead agent to unexplored states. We evaluate our approach in Super Mario Bros for a no reward setting and in Montezuma's Revenge for a sparse reward setting and show that our approach is indeed capable of exploring efficiently. We discuss a few weaknesses and conclude by discussing future works.  ( 2 min )
    An Empirical Study of Challenges in Converting Deep Learning Models. (arXiv:2206.14322v1 [cs.LG])
    There is an increase in deploying Deep Learning (DL)-based software systems in real-world applications. Usually DL models are developed and trained using DL frameworks that have their own internal mechanisms/formats to represent and train DL models, and usually those formats cannot be recognized by other frameworks. Moreover, trained models are usually deployed in environments different from where they were developed. To solve the interoperability issue and make DL models compatible with different frameworks/environments, some exchange formats are introduced for DL models, like ONNX and CoreML. However, ONNX and CoreML were never empirically evaluated by the community to reveal their prediction accuracy, performance, and robustness after conversion. Poor accuracy or non-robust behavior of converted models may lead to poor quality of deployed DL-based software systems. We conduct, in this paper, the first empirical study to assess ONNX and CoreML for converting trained DL models. In our systematic approach, two popular DL frameworks, Keras and PyTorch, are used to train five widely used DL models on three popular datasets. The trained models are then converted to ONNX and CoreML and transferred to two runtime environments designated for such formats, to be evaluated. We investigate the prediction accuracy before and after conversion. Our results unveil that the prediction accuracy of converted models are at the same level of originals. The performance (time cost and memory consumption) of converted models are studied as well. The size of models are reduced after conversion, which can result in optimized DL-based software deployment. Converted models are generally assessed as robust at the same level of originals. However, obtained results show that CoreML models are more vulnerable to adversarial attacks compared to ONNX.
    Diagnosis and Prognosis of COVID-19 Disease Using Routine Blood Values and LogNNet Neural Network. (arXiv:2205.09974v2 [cs.LG] UPDATED)
    Since February 2020, the world has been engaged in an intense struggle with the COVID-19 dis-ease, and health systems have come under tragic pressure as the disease turned into a pandemic. The aim of this study is to obtain the most effective routine blood values (RBV) in the diagnosis and prognosis of COVID-19 using a backward feature elimination algorithm for the LogNNet reservoir neural network. The first dataset in the study consists of a total of 5296 patients with the same number of negative and positive COVID-19 tests. The LogNNet-model achieved the accuracy rate of 99.5% in the diagnosis of the disease with 46 features and the accuracy of 99.17% with only mean corpuscular hemoglobin concentration, mean corpuscular hemoglobin, and activated partial prothrombin time. The second dataset consists of a total of 3899 patients with a diagnosis of COVID-19 who were treated in hospital, of which 203 were severe patients and 3696 were mild patients. The model reached the accuracy rate of 94.4% in determining the prognosis of the disease with 48 features and the accuracy of 82.7% with only erythrocyte sedimentation rate, neutrophil count, and C reactive protein features. Our method will reduce the negative pressures on the health sector and help doctors to understand the pathogenesis of COVID-19 using the key features. The method is promising to create mobile health monitoring systems in the Internet of Things.  ( 3 min )
    TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s. (arXiv:2206.14286v1 [cs.PF])
    This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model that takes into account both the memory and instruction bottlenecks. Our algorithm comes with an analytical guarantee of recall in expectation and does not require maintaining sophisticated index data structure or tuning, making it suitable for applications with frequent updates. Our work is available in the open-source package of Jax and Tensorflow on TPU.
    NumS: Scalable Array Programming for the Cloud. (arXiv:2206.14276v1 [cs.DC])
    Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.  ( 3 min )
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v1 [stat.ML])
    This paper studies the problem of forecasting general stochastic processes using an extension of the Neural Jump ODE (NJ-ODE) framework. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time-series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data.
    Can Interpretable Reinforcement Learning Manage Prosperity Your Way?. (arXiv:2202.09064v2 [cs.LG] UPDATED)
    Personalisation of products and services is fast becoming the driver of success in banking and commerce. Machine learning holds the promise of gaining a deeper understanding of and tailoring to customers' needs and preferences. Whereas traditional solutions to financial decision problems frequently rely on model assumptions, reinforcement learning is able to exploit large amounts of data to improve customer modelling and decision-making in complex financial environments with fewer assumptions. Model explainability and interpretability present challenges from a regulatory perspective which demands transparency for acceptance; they also offer the opportunity for improved insight into and understanding of customers. Post-hoc approaches are typically used for explaining pretrained reinforcement learning models. Based on our previous modeling of customer spending behaviour, we adapt our recent reinforcement learning algorithm that intrinsically characterizes desirable behaviours and we transition to the problem of asset management. We train inherently interpretable reinforcement learning agents to give investment advice that is aligned with prototype financial personality traits which are combined to make a final recommendation. We observe that the trained agents' advice adheres to their intended characteristics, they learn the value of compound growth, and, without any explicit reference, the notion of risk as well as improved policy convergence.  ( 3 min )
    A Temporal-Difference Approach to Policy Gradient Estimation. (arXiv:2202.02396v3 [cs.LG] UPDATED)
    The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.  ( 2 min )
    Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead. (arXiv:2105.09121v3 [cs.LG] UPDATED)
    Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early exiting. In this paper, we introduce a novel architecture for early exiting based on the vision transformer architecture, as well as a fine-tuning strategy that significantly increase the accuracy of early exit branches compared to conventional approaches while introducing less overhead. Through extensive experiments on image and audio classification as well as audiovisual crowd counting, we show that our method works for both classification and regression problems, and in both single- and multi-modal settings. Additionally, we introduce a novel method for integrating audio and visual modalities within early exits in audiovisual data analysis, that can lead to a more fine-grained dynamic inference.  ( 2 min )
    Linear Model Against Malicious Adversaries with Local Differential Privacy. (arXiv:2202.02448v2 [cs.CR] UPDATED)
    Scientific collaborations benefit from collaborative learning of distributed sources, but remain difficult to achieve when data are sensitive. In recent years, privacy preserving techniques have been widely studied to analyze distributed data across different agencies while protecting sensitive information. Most existing privacy preserving techniques are designed to resist semi-honest adversaries and require intense computation to perform data analysis. Secure collaborative learning is significantly difficult with the presence of malicious adversaries who may deviates from the secure protocol. Another challenge is to maintain high computation efficiency with privacy protection. In this paper, matrix encryption is applied to encrypt data such that the secure schemes are against malicious adversaries, including chosen plaintext attack, known plaintext attack, and collusion attack. The encryption scheme also achieves local differential privacy. Moreover, cross validation is studied to prevent overfitting without additional communication cost. Empirical experiments on real-world datasets demonstrate that the proposed schemes are computationally efficient compared to existing techniques against malicious adversary and semi-honest model.  ( 2 min )
    Sparse Centroid-Encoder: A Nonlinear Model for Feature Selection. (arXiv:2201.12910v2 [cs.LG] UPDATED)
    Autoencoders have been widely used as a nonlinear tool for data dimensionality reduction. While autoencoders don't utilize the label information, Centroid-Encoders (CE)\cite{ghosh2022supervised} use the class label in their learning process. In this study, we propose a sparse optimization using the Centroid-Encoder architecture to determine a minimal set of features that discriminate between two or more classes. The resulting algorithm, Sparse Centroid-Encoder (SCE), extracts discriminatory features in groups using a sparsity inducing $\ell_1$-norm while mapping a point to its class centroid. One key attribute of SCE is that it can extract informative features from a multi-modal data set, i.e., data sets whose classes appear to have multiple clusters. The algorithm is applied to a wide variety of real world data sets, including single-cell data, high dimensional biological data, image data, speech data, and accelerometer sensor data. We compared our method to various state-of-the-art feature selection techniques, including supervised Concrete Autoencoders (SCAE), Feature Selection Network (FsNet), deep feature selection (DFS), Stochastic Gate (STG), and LassoNet. We empirically showed that SCE features often produced better classification accuracy than other methods on sequester test set.  ( 3 min )
    Data augmentation for learning predictive models on EEG: a systematic comparison. (arXiv:2206.14483v1 [cs.LG])
    The use of deep learning for electroencephalography (EEG) classification tasks has been rapidly growing in the last years, yet its application has been limited by the relatively small size of EEG datasets. Data augmentation, which consists in artificially increasing the size of the dataset during training, has been a key ingredient to obtain state-of-the-art performances across applications such as computer vision or speech. While a few augmentation transformations for EEG data have been proposed in the literature, their positive impact on performance across tasks remains elusive. In this work, we propose a unified and exhaustive analysis of the main existing EEG augmentations, which are compared in a common experimental setting. Our results highlight the best data augmentations to consider for sleep stage classification and motor imagery brain computer interfaces, showing predictive power improvements greater than 10% in some cases.  ( 2 min )
    Masked World Models for Visual Control. (arXiv:2206.14244v1 [cs.RO])
    Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient robot learning from visual observations. Yet the current approaches typically train a single model end-to-end for learning both visual representations and dynamics, making it difficult to accurately model the interaction between robots and small objects. In this work, we introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. Specifically, we train an autoencoder with convolutional layers and vision transformers (ViT) to reconstruct pixels given masked convolutional features, and learn a latent dynamics model that operates on the representations from the autoencoder. Moreover, to encode task-relevant information, we introduce an auxiliary reward prediction objective for the autoencoder. We continually update both autoencoder and dynamics model using online samples collected from environment interaction. We demonstrate that our decoupling approach achieves state-of-the-art performance on a variety of visual robotic tasks from Meta-world and RLBench, e.g., we achieve 81.7% success rate on 50 visual robotic manipulation tasks from Meta-world, while the baseline achieves 67.9%. Code is available on the project website: https://sites.google.com/view/mwm-rl.  ( 2 min )
    Optimization-Induced Graph Implicit Nonlinear Diffusion. (arXiv:2206.14418v1 [cs.LG])
    Due to the over-smoothing issue, most existing graph neural networks can only capture limited dependencies with their inherently finite aggregation layers. To overcome this limitation, we propose a new kind of graph convolution, called Graph Implicit Nonlinear Diffusion (GIND), which implicitly has access to infinite hops of neighbors while adaptively aggregating features with nonlinear diffusion to prevent over-smoothing. Notably, we show that the learned representation can be formalized as the minimizer of an explicit convex optimization objective. With this property, we can theoretically characterize the equilibrium of our GIND from an optimization perspective. More interestingly, we can induce new structural variants by modifying the corresponding optimization objective. To be specific, we can embed prior properties to the equilibrium, as well as introducing skip connections to promote training stability. Extensive experiments show that GIND is good at capturing long-range dependencies, and performs well on both homophilic and heterophilic graphs with nonlinear diffusion. Moreover, we show that the optimization-induced variants of our models can boost the performance and improve training stability and efficiency as well. As a result, our GIND obtains significant improvements on both node-level and graph-level tasks.  ( 2 min )
    Signature Methods in Machine Learning. (arXiv:2206.14674v1 [stat.ML])
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.  ( 3 min )
    Model-Based Policy Search Using Monte Carlo Gradient Estimation with Real Systems Application. (arXiv:2101.12115v3 [cs.LG] UPDATED)
    In this paper, we present a Model-Based Reinforcement Learning (MBRL) algorithm named \emph{Monte Carlo Probabilistic Inference for Learning COntrol} (MC-PILCO). The algorithm relies on Gaussian Processes (GPs) to model the system dynamics and on a Monte Carlo approach to estimate the policy gradient. This defines a framework in which we ablate the choice of the following components: (i) the selection of the cost function, (ii) the optimization of policies using dropout, (iii) an improved data efficiency through the use of structured kernels in the GP models. The combination of the aforementioned aspects affects dramatically the performance of MC-PILCO. Numerical comparisons in a simulated cart-pole environment show that MC-PILCO exhibits better data efficiency and control performance w.r.t. state-of-the-art GP-based MBRL algorithms. Finally, we apply MC-PILCO to real systems, considering in particular systems with partially measurable states. We discuss the importance of modeling both the measurement system and the state estimators during policy optimization. The effectiveness of the proposed solutions has been tested in simulation and on two real systems, a Furuta pendulum and a ball-and-plate rig.  ( 3 min )
    Enabling Visual Action Planning for Object Manipulation through Latent Space Roadmap. (arXiv:2103.02554v3 [cs.RO] UPDATED)
    We present a framework for visual action planning of complex manipulation tasks with high-dimensional state spaces, focusing on manipulation of deformable objects. We propose a Latent Space Roadmap (LSR) for task planning which is a graph-based structure globally capturing the system dynamics in a low-dimensional latent space. Our framework consists of three parts: (1) a Mapping Module (MM) that maps observations given in the form of images into a structured latent space extracting the respective states as well as generates observations from the latent states, (2) the LSR which builds and connects clusters containing similar states in order to find the latent plans between start and goal states extracted by MM, and (3) the Action Proposal Module that complements the latent plan found by the LSR with the corresponding actions. We present a thorough investigation of our framework on simulated box stacking and rope/box manipulation tasks, and a folding task executed on a real robot.  ( 2 min )
    Understanding Generalization via Leave-One-Out Conditional Mutual Information. (arXiv:2206.14800v1 [cs.LG])
    We study the mutual information between (certain summaries of) the output of a learning algorithm and its $n$ training data, conditional on a supersample of $n+1$ i.i.d. data from which the training data is chosen at random without replacement. These leave-one-out variants of the conditional mutual information (CMI) of an algorithm (Steinke and Zakynthinou, 2020) are also seen to control the mean generalization error of learning algorithms with bounded loss functions. For learning algorithms achieving zero empirical risk under 0-1 loss (i.e., interpolating algorithms), we provide an explicit connection between leave-one-out CMI and the classical leave-one-out error estimate of the risk. Using this connection, we obtain upper and lower bounds on risk in terms of the (evaluated) leave-one-out CMI. When the limiting risk is constant or decays polynomially, the bounds converge to within a constant factor of two. As an application, we analyze the population risk of the one-inclusion graph algorithm, a general-purpose transductive learning algorithm for VC classes in the realizable setting. Using leave-one-out CMI, we match the optimal bound for learning VC classes in the realizable setting, answering an open challenge raised by Steinke and Zakynthinou (2020). Finally, in order to understand the role of leave-one-out CMI in studying generalization, we place leave-one-out CMI in a hierarchy of measures, with a novel unconditional mutual information at the root. For 0-1 loss and interpolating learning algorithms, this mutual information is observed to be precisely the risk.  ( 3 min )
    Meta-Learning over Time for Destination Prediction Tasks. (arXiv:2206.14801v1 [cs.LG])
    A need to understand and predict vehicles' behavior underlies both public and private goals in the transportation domain, including urban planning and management, ride-sharing services, and intelligent transportation systems. Individuals' preferences and intended destinations vary throughout the day, week, and year: for example, bars are most popular in the evenings, and beaches are most popular in the summer. Despite this principle, we note that recent studies on a popular benchmark dataset from Porto, Portugal have found, at best, only marginal improvements in predictive performance from incorporating temporal information. We propose an approach based on hypernetworks, a variant of meta-learning ("learning to learn") in which a neural network learns to change its own weights in response to an input. In our case, the weights responsible for destination prediction vary with the metadata, in particular the time, of the input trajectory. The time-conditioned weights notably improve the model's error relative to ablation studies and comparable prior work, and we confirm our hypothesis that knowledge of time should improve prediction of a vehicle's intended destination.  ( 2 min )
    ENS-10: A Dataset For Post-Processing Ensemble Weather Forecast. (arXiv:2206.14786v1 [cs.LG])
    Post-processing ensemble prediction systems can improve weather forecasting, especially for extreme event prediction. In recent years, different machine learning models have been developed to improve the quality of the post-processing step. However, these models heavily rely on the data and generating such ensemble members requires multiple runs of numerical weather prediction models, at high computational cost. This paper introduces the ENS-10 dataset, consisting of ten ensemble members spread over 20 years (1998-2017). The ensemble members are generated by perturbing numerical weather simulations to capture the chaotic behavior of the Earth. To represent the three-dimensional state of the atmosphere, ENS-10 provides the most relevant atmospheric variables in 11 distinct pressure levels as well as the surface at 0.5-degree resolution. The dataset targets the prediction correction task at 48-hour lead time, which is essentially improving the forecast quality by removing the biases of the ensemble members. To this end, ENS-10 provides the weather variables for forecast lead times T=0, 24, and 48 hours (two data points per week). We provide a set of baselines for this task on ENS-10 and compare their performance in correcting the prediction of different weather variables. We also assess our baselines for predicting extreme events using our dataset. The ENS-10 dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence.  ( 3 min )
    Multi-scale Physical Representations for Approximating PDE Solutions with Graph Neural Operators. (arXiv:2206.14687v1 [cs.LG])
    Representing physical signals at different scales is among the most challenging problems in engineering. Several multi-scale modeling tools have been developed to describe physical systems governed by \emph{Partial Differential Equations} (PDEs). These tools are at the crossroad of principled physical models and numerical schema. Recently, data-driven models have been introduced to speed-up the approximation of PDE solutions compared to numerical solvers. Among these recent data-driven methods, neural integral operators are a class that learn a mapping between function spaces. These functions are discretized on graphs (meshes) which are appropriate for modeling interactions in physical phenomena. In this work, we study three multi-resolution schema with integral kernel operators that can be approximated with \emph{Message Passing Graph Neural Networks} (MPGNNs). To validate our study, we make extensive MPGNNs experiments with well-chosen metrics considering steady and unsteady PDEs.  ( 2 min )
    Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model. (arXiv:2206.14371v1 [stat.ML])
    In this paper, we present a novel insider attack called Matryoshka, which employs an irrelevant scheduled-to-publish DNN model as a carrier model for covert transmission of multiple secret models which memorize the functionality of private ML data stored in local data centers. Instead of treating the parameters of the carrier model as bit strings and applying conventional steganography, we devise a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Matryoshka simultaneously achieves: (i) High Capacity -- With almost no utility loss of the carrier model, Matryoshka can hide a 26x larger secret model or 8 secret models of diverse architectures spanning different application domains in the carrier model, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency -- once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness -- Moreover, almost all the recovered models have similar performance as if it were trained independently on the private data; (iv) Robustness -- Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness -- A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.
    SPI-GAN: Distilling Score-based Generative Models with Straight-Path Interpolations. (arXiv:2206.14464v1 [cs.LG])
    Score-based generative models (SGMs) are a recently proposed paradigm for deep generative tasks and now show the state-of-the-art sampling performance. It is known that the original SGM design solves the two problems of the generative trilemma: i) sampling quality, and ii) sampling diversity. However, the last problem of the trilemma was not solved, i.e., their training/sampling complexity is notoriously high. To this end, distilling SGMs into simpler models, e.g., generative adversarial networks (GANs), is gathering much attention currently. We present an enhanced distillation method, called straight-path interpolation GAN (SPI-GAN), which can be compared to the state-of-the-art shortcut-based distillation method, called denoising diffusion GAN (DD-GAN). However, our method corresponds to an extreme method that does not use any intermediate shortcut information of the reverse SDE path, in which case DD-GAN fails to obtain good results. Nevertheless, our straight-path interpolation method greatly stabilizes the overall training process. As a result, SPI-GAN is one of the best models in terms of the sampling quality/diversity/time for CIFAR-10, CelebA-HQ-256, and LSUN-Church-256.
  • Open

    Functional Classification of Bitcoin Addresses. (arXiv:2202.12019v2 [stat.AP] UPDATED)
    This paper proposes a classification model for predicting the main activity of bitcoin addresses based on their balances. Since the balances are functions of time, we apply methods from functional data analysis; more specifically, the features of the proposed classification model are the functional principal components of the data. Classifying bitcoin addresses is a relevant problem for two main reasons: to understand the composition of the bitcoin market, and to identify addresses used for illicit activities. Although other bitcoin classifiers have been proposed, they focus primarily on network analysis rather than curve behavior. Our approach, on the other hand, does not require any network information for prediction. Furthermore, functional features have the advantage of being straightforward to build, unlike expert-built features. Results show improvement when combining functional features with scalar features, and similar accuracy for the models using those features separately, which points to the functional model being a good alternative when domain-specific knowledge is not available.  ( 2 min )
    The split Gibbs sampler revisited: improvements to its algorithmic structure and augmented target distribution. (arXiv:2206.13894v1 [stat.CO] CROSS LISTED)
    This paper proposes a new accelerated proximal Markov chain Monte Carlo (MCMC) methodology to perform Bayesian computation efficiently in imaging inverse problems. The proposed methodology is derived from the Langevin diffusion process and stems from tightly integrating two state-of-the-art proximal Langevin MCMC samplers, SK-ROCK and split Gibbs sampling (SGS), which employ distinctively different strategies to improve convergence speed. More precisely, we show how to integrate, at the level of the Langevin diffusion process, the proximal SK-ROCK sampler which is based on a stochastic Runge-Kutta-Chebyshev approximation of the diffusion, with the model augmentation and relaxation strategy that SGS exploits to speed up Bayesian computation at the expense of asymptotic bias. This leads to a new and faster proximal SK-ROCK sampler that combines the accelerated quality of the original SK-ROCK sampler with the computational benefits of augmentation and relaxation. Moreover, rather than viewing the augmented and relaxed model as an approximation of the target model, positioning relaxation in a bias-variance trade-off, we propose to regard the augmented and relaxed model as a generalisation of the target model. This then allows us to carefully calibrate the amount of relaxation in order to simultaneously improve the accuracy of the model (as measured by the model evidence) and the sampler's convergence speed. To achieve this, we derive an empirical Bayesian method to automatically estimate the optimal amount of relaxation by maximum marginal likelihood estimation. The proposed methodology is demonstrated with a range of numerical experiments related to image deblurring and inpainting, as well as with comparisons with alternative approaches from the state of the art.
    Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. (arXiv:2201.12417v2 [cs.LG] UPDATED)
    In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellations from both sides of the Bellman equation, the magnitude of the Bellman error is only weakly related to the distance to the true value function, even when considering all state-action pairs, and (2) in the finite data regime, the Bellman equation can be satisfied exactly by infinitely many suboptimal solutions. This means that the Bellman error can be minimized without improving the accuracy of the value function. We demonstrate these phenomena through a series of propositions, illustrative toy examples, and empirical analysis in standard benchmark domains.  ( 2 min )
    Depth-2 Neural Networks Under a Data-Poisoning Attack. (arXiv:2005.01699v3 [cs.LG] UPDATED)
    In this work, we study the possibility of defending against data-poisoning attacks while training a shallow neural network in a regression setup. We focus on doing supervised learning for a class of depth-2 finite-width neural networks, which includes single-filter convolutional networks. In this class of networks, we attempt to learn the network weights in the presence of a malicious oracle doing stochastic, bounded and additive adversarial distortions on the true output during training. For the non-gradient stochastic algorithm that we construct, we prove worst-case near-optimal trade-offs among the magnitude of the adversarial attack, the weight approximation accuracy, and the confidence achieved by the proposed algorithm. As our algorithm uses mini-batching, we analyze how the mini-batch size affects convergence. We also show how to utilize the scaling of the outer layer weights to counter output-poisoning attacks depending on the probability of attack. Lastly, we give experimental evidence demonstrating how our algorithm outperforms stochastic gradient descent under different input data distributions, including instances of heavy-tailed distributions.  ( 2 min )
    Non-Parametric Manifold Learning. (arXiv:2107.08089v2 [math.ST] UPDATED)
    We introduce an estimator for distances in a compact Riemannian manifold M based on graph Laplacian estimates of the Laplace-Beltrami operator. We upper bound the l2-loss for the ratio of the estimator over the true manifold distance, or more precisely an approximation of manifold distance in non-commutative geometry (cf. [Connes and Suijelekom, 2020]), in terms of spectral errors in the graph Laplacian estimates and, implicitly, several geometric properties of the manifold. We consequently obtain a consistency result for the estimator for samples equidistributed from a strictly positive density on M and graph Laplacians which spectrally converge, in a suitable sense, to the Laplace-Beltrami operator. The estimator resembles, and in fact its convergence properties are derived from, a special case of the Kontorovic dual reformulation of Wasserstein distance known as Connes' Distance Formula.  ( 2 min )
    Forgetting Data from Pre-trained GANs. (arXiv:2206.14389v1 [cs.LG])
    Large pre-trained generative models are known to occasionally provide samples that may be undesirable for various reasons. The standard way to mitigate this is to re-train the models differently. In this work, we take a different, more compute-friendly approach and investigate how to post-edit a model after training so that it forgets certain kinds of samples. We provide three different algorithms for GANs that differ on how the samples to be forgotten are described. Extensive evaluations on real-world image datasets show that our algorithms are capable of forgetting data while retaining high generation quality at a fraction of the cost of full re-training.  ( 2 min )
    IBP Regularization for Verified Adversarial Robustness via Branch-and-Bound. (arXiv:2206.14772v1 [cs.LG])
    Recent works have tried to increase the verifiability of adversarially trained networks by running the attacks over domains larger than the original perturbations and adding various regularization terms to the objective. However, these algorithms either underperform or require complex and expensive stage-wise training procedures, hindering their practical applicability. We present IBP-R, a novel verified training algorithm that is both simple and effective. IBP-R induces network verifiability by coupling adversarial attacks on enlarged domains with a regularization term, based on inexpensive interval bound propagation, that minimizes the gap between the non-convex verification problem and its approximations. By leveraging recent branch-and-bound frameworks, we show that IBP-R obtains state-of-the-art verified robustness-accuracy trade-offs for small perturbations on CIFAR-10 while training significantly faster than relevant previous work. Additionally, we present UPB, a novel branching strategy that, relying on a simple heuristic based on $\beta$-CROWN, reduces the cost of state-of-the-art branching algorithms while yielding splits of comparable quality.  ( 2 min )
    Bayesian Structure Learning with Generative Flow Networks. (arXiv:2202.13903v2 [cs.LG] UPDATED)
    In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.  ( 2 min )
    Towards Robust Waveform-Based Acoustic Models. (arXiv:2110.08634v2 [cs.SD] UPDATED)
    We study the problem of learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. First, we characterize data augmentation theoretically as an instance of vicinal risk minimization, which aims at improving risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We then specify the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus on the waveform-based setting. Our empirical results show that the approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances.  ( 3 min )
    Signature Methods in Machine Learning. (arXiv:2206.14674v1 [stat.ML])
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.  ( 3 min )
    Treatment Effect Estimation from Observational Network Data using Augmented Inverse Probability Weighting and Machine Learning. (arXiv:2206.14591v1 [stat.ME])
    Causal inference methods for treatment effect estimation usually assume independent experimental units. However, this assumption is often questionable because experimental units may interact. We develop augmented inverse probability weighting (AIPW) for estimation and inference of causal treatment effects on dependent observational data. Our framework covers very general cases of spillover effects induced by units interacting in networks. We use plugin machine learning to estimate infinite-dimensional nuisance components leading to a consistent treatment effect estimator that converges at the parametric rate and asymptotically follows a Gaussian distribution.  ( 2 min )
    When Do Extended Physics-Informed Neural Networks (XPINNs) Improve Generalization?. (arXiv:2109.09444v5 [cs.LG] UPDATED)
    Physics-informed neural networks (PINNs) have become a popular choice for solving high-dimensional partial differential equations (PDEs) due to their excellent approximation power and generalization ability. Recently, Extended PINNs (XPINNs) based on domain decomposition methods have attracted considerable attention due to their effectiveness in modeling multiscale and multiphysics problems and their parallelization. However, theoretical understanding on their convergence and generalization properties remains unexplored. In this study, we take an initial step towards understanding how and when XPINNs outperform PINNs. Specifically, for general multi-layer PINNs and XPINNs, we first provide a prior generalization bound via the complexity of the target functions in the PDE problem, and a posterior generalization bound via the posterior matrix norms of the networks after optimization. Moreover, based on our bounds, we analyze the conditions under which XPINNs improve generalization. Concretely, our theory shows that the key building block of XPINN, namely the domain decomposition, introduces a tradeoff for generalization. On the one hand, XPINNs decompose the complex PDE solution into several simple parts, which decreases the complexity needed to learn each part and boosts generalization. On the other hand, decomposition leads to less training data being available in each subdomain, and hence such model is typically prone to overfitting and may become less generalizable. Empirically, we choose five PDEs to show when XPINNs perform better than, similar to, or worse than PINNs, hence demonstrating and justifying our new theory.  ( 3 min )
    MurTree: Optimal Classification Trees via Dynamic Programming and Search. (arXiv:2007.12652v4 [cs.LG] UPDATED)
    Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical realisation of optimal decision trees.  ( 3 min )
    Beyond neural scaling laws: beating power law scaling via data pruning. (arXiv:2206.14486v1 [cs.LG])
    Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet. Given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.  ( 3 min )
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v2 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors: $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the overparametrized region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.  ( 2 min )
    Linear Model Against Malicious Adversaries with Local Differential Privacy. (arXiv:2202.02448v2 [cs.CR] UPDATED)
    Scientific collaborations benefit from collaborative learning of distributed sources, but remain difficult to achieve when data are sensitive. In recent years, privacy preserving techniques have been widely studied to analyze distributed data across different agencies while protecting sensitive information. Most existing privacy preserving techniques are designed to resist semi-honest adversaries and require intense computation to perform data analysis. Secure collaborative learning is significantly difficult with the presence of malicious adversaries who may deviates from the secure protocol. Another challenge is to maintain high computation efficiency with privacy protection. In this paper, matrix encryption is applied to encrypt data such that the secure schemes are against malicious adversaries, including chosen plaintext attack, known plaintext attack, and collusion attack. The encryption scheme also achieves local differential privacy. Moreover, cross validation is studied to prevent overfitting without additional communication cost. Empirical experiments on real-world datasets demonstrate that the proposed schemes are computationally efficient compared to existing techniques against malicious adversary and semi-honest model.  ( 2 min )
    Score Matching for Truncated Density Estimation on a Manifold. (arXiv:2206.14668v1 [stat.ME])
    When observations are truncated, we are limited to an incomplete picture of our dataset. Recent methods deal with truncated density estimation problems by turning to score matching, where the access to the intractable normalising constant is not required. We present a novel extension to truncated score matching for a Riemannian manifold. Applications are presented for the von Mises-Fisher and Kent distributions on a two dimensional sphere in $\R^3$, as well as a real-world application of extreme storm observations in the USA. In simulated data experiments, our score matching estimator is able to approximate the true parameter values with a low estimation error and shows improvements over a maximum likelihood estimator.  ( 2 min )
    Cyclical Kernel Adaptive Metropolis. (arXiv:2206.14421v1 [cs.LG])
    We propose cKAM, cyclical Kernel Adaptive Metropolis, which incorporates a cyclical stepsize scheme to allow control for exploration and sampling. We show that on a crafted bimodal distribution, existing Adaptive Metropolis type algorithms would fail to converge to the true posterior distribution. We point out that this is because adaptive samplers estimates the local/global covariance structure using past history of the chain, which will lead to adaptive algorithms be trapped in a local mode. We demonstrate that cKAM encourages exploration of the posterior distribution and allows the sampler to escape from a local mode, while maintaining the high performance of adaptive methods.  ( 2 min )
    When Does Group Invariant Learning Survive Spurious Correlations?. (arXiv:2206.14534v1 [cs.LG])
    By inferring latent groups in the training data, recent works introduce invariant learning to the case where environment annotations are unavailable. Typically, learning group invariance under a majority/minority split is empirically shown to be effective in improving out-of-distribution generalization on many datasets. However, theoretical guarantee for these methods on learning invariant mechanisms is lacking. In this paper, we reveal the insufficiency of existing group invariant learning methods in preventing classifiers from depending on spurious correlations in the training set. Specifically, we propose two criteria on judging such sufficiency. Theoretically and empirically, we show that existing methods can violate both criteria and thus fail in generalizing to spurious correlation shifts. Motivated by this, we design a new group invariant learning method, which constructs groups with statistical independence tests, and reweights samples by group label proportion to meet the criteria. Experiments on both synthetic and real data demonstrate that the new method significantly outperforms existing group invariant learning methods in generalizing to spurious correlation shifts.  ( 2 min )
    Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution. (arXiv:2009.14108v2 [cs.LG] UPDATED)
    Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at https://github.com/ml-jku/align-rudder. YouTube: https://youtu.be/HO-_8ZUl-UY  ( 2 min )
    Can Push-forward Generative Models Fit Multimodal Distributions?. (arXiv:2206.14476v1 [stat.ML])
    Many generative models synthesize data by transforming a standard Gaussian random variable using a deterministic neural network. Among these models are the Variational Autoencoders and the Generative Adversarial Networks. In this work, we call them "push-forward" models and study their expressivity. We show that the Lipschitz constant of these generative networks has to be large in order to fit multimodal distributions. More precisely, we show that the total variation distance and the Kullback-Leibler divergence between the generated and the data distribution are bounded from below by a constant depending on the mode separation and the Lipschitz constant. Since constraining the Lipschitz constants of neural networks is a common way to stabilize generative models, there is a provable trade-off between the ability of push-forward models to approximate multimodal distributions and the stability of their training. We validate our findings on one-dimensional and image datasets and empirically show that generative models consisting of stacked networks with stochastic input at each step, such as diffusion models do not suffer of such limitations.  ( 2 min )
    An Auto-Regressive Formulation for Smoothing and Moving Mean with Exponentially Tapered Windows. (arXiv:2206.14749v1 [cs.LG])
    We investigate an auto-regressive formulation for the problem of smoothing time-series by manipulating the inherent objective function of the traditional moving mean smoothers. Not only the auto-regressive smoothers enforce a higher degree of smoothing, they are just as efficient as the traditional moving means and can be optimized accordingly with respect to the input dataset. Interestingly, the auto-regressive models result in moving means with exponentially tapered windows.  ( 2 min )
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v2 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.  ( 2 min )
    Approximate Data Deletion in Generative Models. (arXiv:2206.14439v1 [cs.LG])
    Users have the right to have their data deleted by third-party learned systems, as codified by recent legislation such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA). Such data deletion can be accomplished by full re-training, but this incurs a high computational cost for modern machine learning models. To avoid this cost, many approximate data deletion methods have been developed for supervised learning. Unsupervised learning, in contrast, remains largely an open problem when it comes to (approximate or exact) efficient data deletion. In this paper, we propose a density-ratio-based framework for generative models. Using this framework, we introduce a fast method for approximate data deletion and a statistical test for estimating whether or not training points have been deleted. We provide theoretical guarantees under various learner assumptions and empirically demonstrate our methods across a variety of generative methods.  ( 2 min )
    Open Problem: Properly learning decision trees in polynomial time?. (arXiv:2206.14431v1 [cs.DS])
    The authors recently gave an $n^{O(\log\log n)}$ time membership query algorithm for properly learning decision trees under the uniform distribution (Blanc et al., 2021). The previous fastest algorithm for this problem ran in $n^{O(\log n)}$ time, a consequence of Ehrenfeucht and Haussler (1989)'s classic algorithm for the distribution-free setting. In this article we highlight the natural open problem of obtaining a polynomial-time algorithm, discuss possible avenues towards obtaining it, and state intermediate milestones that we believe are of independent interest.  ( 2 min )
    A Perturbation Bound on the Subspace Estimator from Canonical Projections. (arXiv:2206.14278v1 [stat.ML])
    This paper derives a perturbation bound on the optimal subspace estimator obtained from a subset of its canonical projections contaminated by noise. This fundamental result has important implications in matrix completion, subspace clustering, and related problems.  ( 2 min )
    Active Exploration via Experiment Design in Markov Chains. (arXiv:2206.14332v1 [cs.LG])
    A key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a {\em Markov chain}, and we can only choose them by selecting a {\em policy} controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm -- \textsc{markov-design} -- that efficiently selects policies whose measurement allocation \emph{provably converges to the optimal one}. The algorithm is sequential in nature, adapting its choice of policies (experiments) informed by past measurements. In addition to our theoretical analysis, we showcase our framework on applications in ecological surveillance and pharmacology.  ( 2 min )
    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v1 [stat.ML])
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.  ( 2 min )
    Matryoshka: Stealing Functionality of Private ML Data by Hiding Models in Model. (arXiv:2206.14371v1 [stat.ML])
    In this paper, we present a novel insider attack called Matryoshka, which employs an irrelevant scheduled-to-publish DNN model as a carrier model for covert transmission of multiple secret models which memorize the functionality of private ML data stored in local data centers. Instead of treating the parameters of the carrier model as bit strings and applying conventional steganography, we devise a novel parameter sharing approach which exploits the learning capacity of the carrier model for information hiding. Matryoshka simultaneously achieves: (i) High Capacity -- With almost no utility loss of the carrier model, Matryoshka can hide a 26x larger secret model or 8 secret models of diverse architectures spanning different application domains in the carrier model, neither of which can be done with existing steganography techniques; (ii) Decoding Efficiency -- once downloading the published carrier model, an outside colluder can exclusively decode the hidden models from the carrier model with only several integer secrets and the knowledge of the hidden model architecture; (iii) Effectiveness -- Moreover, almost all the recovered models have similar performance as if it were trained independently on the private data; (iv) Robustness -- Information redundancy is naturally implemented to achieve resilience against common post-processing techniques on the carrier before its publishing; (v) Covertness -- A model inspector with different levels of prior knowledge could hardly differentiate a carrier model from a normal model.  ( 3 min )
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v1 [stat.ML])
    This paper studies the problem of forecasting general stochastic processes using an extension of the Neural Jump ODE (NJ-ODE) framework. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time-series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data.  ( 2 min )
    Target alignment in truncated kernel ridge regression. (arXiv:2206.14255v1 [cs.LG])
    Kernel ridge regression (KRR) has recently attracted renewed interest due to its potential for explaining the transient effects, such as double descent, that emerge during neural network training. In this work, we study how the alignment between the target function and the kernel affects the performance of the KRR. We focus on the truncated KRR (TKRR) which utilizes an additional parameter that controls the spectral truncation of the kernel matrix. We show that for polynomial alignment, there is an \emph{over-aligned} regime, in which TKRR can achieve a faster rate than what is achievable by full KRR. The rate of TKRR can improve all the way to the parametric rate, while that of full KRR is capped at a sub-optimal value. This shows that target alignemnt can be better leveraged by utilizing spectral truncation in kernel methods. We also consider the bandlimited alignment setting and show that the regularization surface of TKRR can exhibit transient effects including multiple descent and non-monotonic behavior. Our results show that there is a strong and quantifable relation between the shape of the \emph{alignment spectrum} and the generalization performance of kernel methods, both in terms of rates and in finite samples.  ( 2 min )
    Intrinsic Anomaly Detection for Multi-Variate Time Series. (arXiv:2206.14342v1 [cs.LG])
    We introduce a novel, practically relevant variation of the anomaly detection problem in multi-variate time series: intrinsic anomaly detection. It appears in diverse practical scenarios ranging from DevOps to IoT, where we want to recognize failures of a system that operates under the influence of a surrounding environment. Intrinsic anomalies are changes in the functional dependency structure between time series that represent an environment and time series that represent the internal state of a system that is placed in said environment. We formalize this problem, provide under-studied public and new purpose-built data sets for it, and present methods that handle intrinsic anomaly detection. These address the short-coming of existing anomaly detection methods that cannot differentiate between expected changes in the system's state and unexpected ones, i.e., changes in the system that deviate from the environment's influence. Our most promising approach is fully unsupervised and combines adversarial learning and time series representation learning, thereby addressing problems such as label sparsity and subjectivity, while allowing to navigate and improve notoriously problematic anomaly detection data sets.  ( 2 min )
    No imputation without representation. (arXiv:2206.14254v1 [cs.LG])
    By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. We find that on these datasets, missing-indicators generally increase classification performance. In addition, we find no evidence for most algorithms that nearest neighbour and iterative imputation lead to better performance than simple mean/mode imputation. Therefore, we recommend the use of missing-indicators with mean/mode imputation as a safe default, with the caveat that for decision trees, pruning is necessary to prevent overfitting. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance, and observe that these thresholds are much lower for categorical than for numerical attributes. Finally, we argue that mean imputation of numerical attributes may preserve some of the information from missing values, and we show that in the absence of missing-indicators, it can similarly be useful to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.  ( 3 min )

  • Open

    [P] Some best-practice questions about my first project; predicting much I will enjoy backpacking different trails
    I like to go backpacking (multi-day hikes) and I want to build a model to predict how much I will enjoy the trails on my watchlist. I understand it is silly to predict how I will subjectively experience something, but it seems fun to see what gets spat out. I just have some questions on best practice. This was the best dataset I could find. It isn't perfect. I mainly hike in Canada and I only care about backpacking trails. Whereas, this dataset is about trails in the USA and only ~800/3000 are backpacking trails. From it I can get the following features: latitude longitude length elevation gain route type waterfall (boolean) lake (boolean) river (boolean) forest (boolean) cave (boolean) backpacking (boolean) rating (according to AllTrails.com, this is what I will be predicting for my watchlist trails) Another problem with this dataset is with the 5 boolean traits (waterfall, lake, river, forest, cave). If it is unknown whether a trail qualifies for any of these traits, the trait will be set to false. Also, the rating values have been rounded to the nearest 0.5 (on a scale from 0-5). I just have to make the best of it, I couldn't find a better dataset. The plan is to personalize the model to me. I'm going to add my completed trails to the dataset and give each a personal rating. Then I'll add a new feature, called something like "isMe" which will be 1 for my trails and 0 otherwise. Now, time for questions: Does it makes sense to use latitude and longitude when I don't hike in the area covered by the dataset? Should I cut the ~2200/3000 rows from the dataset that aren't backpacking trails since I only want to predict the rating for backpacking trails? Since the rating values have been binned, would that mean I am predicting a category or a numerical value? These are only the questions I can think to ask. Feel free to hit me with any other pointers you have to make this silly model as accurate as I can! submitted by /u/JamesonLKJ [link] [comments]  ( 86 min )
    [P] Neural Network Steganography (implementation) - Hiding secrets and malicious software in any neural network
    I saw a paper called EvilModel on how to hide malicious code in a neural network as we have thousands or millions of parameters that we can alter. This basic technique is based on the modification of the float32 values (but can be adapted to float16) where we modify the fraction bits or part of the fraction. Post/Tutorial on the process GitHub repo for the project EvilModel paper As I saw with my experiments, we could easily hide megabytes of code in a simple ResNet50 and get away with it. A well-trained (and generalized) network should not degrade in performance significantly. The testing of that is planned for a future post. Also, this method could be used for watermarking neural network weights which could help with copyright claims (e.g.: someone is using your open-sourced (and appropriately licensed) weights out of the box in a commercial product) submitted by /u/gabegabe6 [link] [comments]  ( 89 min )
    [D] Training GANs with non-square images
    I am planning to train stylegan2 ada with rectangular images (aspect ratio = 16:9). Is it better to use (zero) padding, resizing, or train a rectangular GAN? Thankyou verymuch! submitted by /u/antarfrica [link] [comments]  ( 84 min )
    [D] Mixed Precision Training: Difference between BF16 and FP16
    What differences in model performance, speed, memory etc. can I expect between choosing BF16 or FP16 for mixed precision training? Is BF16 faster / consumes less memory, since I have seen people say it is "more suitable for Deep Learning". Why is that the case? submitted by /u/optimized-adam [link] [comments]  ( 87 min )
    [D] AI & Big Data Expo; worth it?
    Interested in AI/Machine learning research, hoping to check out their NA expo to learn more. Has anyone here ever been to one of their conventions? What were your experiences like? submitted by /u/nyxrat [link] [comments]  ( 84 min )
    [R] Use pretrained GANs and image classifier to generate images of the class
    Pretrained GANs and CLIP embeddings have been used to created images from arbitrary caption, by backpropagating CLIP similarity of the caption and the generated image down to the generator input noise. I am thinking of something simpler, where I would take a pretrained GAN, and backpropagate through some pretrained classifier (e.g. image et) down to the input noise to the Generator to generate images of that class. Is there any reference that does that? And more generally, i want to understand why this approach works - simple backpropagating the classifier loss to the image (and not through the generator) typically result in deep dream type of weird images. Why does this not happen when using a generator? Is it simply because the output of the generator lives in the manifold of "real" images? Is there more to it? Thanks in advance submitted by /u/ml_rl_questions [link] [comments]  ( 86 min )
    [D] What are the lessons learned in the preparations of the dataset you will use to train a GANs?
    Hello friends, what are the key points we should pay attention to in the datasets you will prepare for GANs, do you have any suggestions? For example the distribution of the dataset should be like this, the images should be the same size, it is important to reduce all the images to this size, many things that I have not thought of at the moment? What are your recommendations? submitted by /u/metover [link] [comments]  ( 84 min )
    [P] Unofficial Gato in TensorFlow
    https://github.com/OrigamiDream/gato I am building Deepmind's Gato imitation in TensorFlow. All necessary layers have been completely implemented. ​ However, I have no idea how to map out the training strategy, and I do not have enough datasets for this. The model seems impossible for end-to-end training because of its conditional and selective tokenizer and embeddings, and differentiable programming. ​ If you are interested in this project, add a star and notification to this repository for further updates. And someone who want to contribute to this project, please create a relevant issue or pull request. ​ Thank you. submitted by /u/AvisStudio [link] [comments]  ( 85 min )
    What is the essence of Diffusion models? [D]
    Coming from a math/stats background the point of much of the machine learning literature can take time to understand fully, in particular I have a couple of quick (interconnected) questions regarding the essence of Diffusion models that I hope somebody may answer (of the many blog posts I have read I can't seem to find a clear answer). As a reference let me take the seminal paper of Ho et al. https://arxiv.org/abs/2006.11239 When fixing the coefficients $\beta_1, \dots, \beta_T$ that govern the forward diffusion process (treating them as hyperparameters) can't we, at least in simple cases, already recover the reverse diffusion process in closed form? If yes why do we even need to find the reverse diffusion process through an optimization procedure when we already have it in closed form? I have read that diffusion models should perform a dimensionality reduction on the data but, even understanding the mathematics, I can't understand how the dimensionality reduction is being achieved by learning the reverse process. What is the usefulness behind the whole procedure? If the forward process converges to an isotropic Gaussian (it destroys all the structure in the data) how can we hope to learn anything significant from it if it becomes simply a bunch of noise. (I suspect that the answer to this question is that we always stop the forward process before it becomes its limit) Thanks to anyone that can clear up these doubts of mine. submitted by /u/Mon0o0 [link] [comments]  ( 85 min )
  • Open

    The Persistence Problem: Lessons learned from illustrating a children's book with GPT-3 and crAIyon.
    submitted by /u/laul_pogan [link] [comments]  ( 83 min )
    "A magical forest full of colourful mushrooms" 🍄 Created on Pixelz.ai
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    Brain Power Level AI Supercomputer With 174 Trillion Parameters | AI Robot Arm Learns By Vision | New System To Train Autonomous Vehicles | Brain Tumor Detection AI Outperforms Humans
    submitted by /u/getrich_or_diemining [link] [comments]  ( 83 min )
    Generating Children's Stories Using GPT-3 and DALL·E
    submitted by /u/BB4evaTB12 [link] [comments]  ( 82 min )
    AI benchmark MLPerf: Nvidia dominates, but Graphcore establishes itself
    submitted by /u/much_successes [link] [comments]  ( 82 min )
    Who needs midjourney invites
    Recent got more added, hmu if you need one submitted by /u/Chemical-Exchange466 [link] [comments]  ( 83 min )
    A Step-by-Step Walkthrough Neural Networks for Time-series Forecasting
    submitted by /u/lucapiccinelli [link] [comments]  ( 82 min )
    Generating "Levels" from data and rules using Artificial Intelligence?
    What is the best approach to creating a video game level (for simplicity sake, just a list of positions/vectors) based on a database of already existing levels, and a set of constraints? My biggest problem is creating an AI that has no input layer, and also a variable length output. If you have any ideas, please let me know (: submitted by /u/iLoveNintend0 [link] [comments]  ( 83 min )
    45 worked examples in machine learning (energy, medicine, banking, retail, physics, finance...)
    submitted by /u/datapablo [link] [comments]  ( 82 min )
    Tutorial Warp/Flow
    ​ Just a basic tutorial on using starting video and a demo of the warp/flow Working on upscaling some videos soon too that will use both warp flow and 3d animation that are looking cool so far. ​ https://www.youtube.com/watch?v=VN6dgVjzOq0 https://preview.redd.it/eul1u8v33j891.jpg?width=1920&format=pjpg&auto=webp&s=0810a191105c52bfa0d04923c9e1ba0b366940a8 submitted by /u/prfitofthesngularity [link] [comments]  ( 83 min )
    Advanced Endpoint Intelligence
    submitted by /u/Peter909098 [link] [comments]  ( 82 min )
    [P] Open source that takes as input a deep learning model and outputs a version that runs faster in inference. Now faster and easier to use (New release)
    nebullvm is an open-source library that takes an AI model as input and outputs an optimized version that runs much faster on your hardware, usually achieving 2 to 5 times faster inference without losing accuracy (benchmarks below for Option A), or even more if you specify that you are willing to sacrifice some accuracy for a lighter model with even lower latency, using compression techniques (Option B, leveraging multiple quantization methods [1], soon also pruning [2] and more) https://github.com/nebuly-ai/nebullvm nebullvm now supports also PyTorch and TensorFlow backends that, together with the already supported deep learning compilers (including ONNX runtime [3], TensorRT [4], OpenVINO [5], Apache TVM [6]), will optimize how your model is mapped to your hardware. Together these techniques will allow nebullvm to explore more paths and find the best way to make the most of your hardware's computing capabilities, making inference as fast as it can run. You can run nebullvm in just a few lines of code, and after many requests from users, I simplified the installation of these deep learning compilers. In addition to the option of installing all compilers with a single command, it is now possible to skip the installation to pull Docker images with compilers already preinstalled. Discover more here. Many more releases are on the way. And if you have questions, ideas and product suggestions, I'm more than happy to discuss them here! And don't forget to leave a small star for all the open-source work to make DL optimization techniques more accessible :) https://preview.redd.it/h9rshzajhh891.png?width=1480&format=png&auto=webp&s=e4d213434a6b1f949751c4b423fe3bc581a1977d [1] Quantization. Techniques and Concept Map. [2] Pruning. Techniques and Concept Map. [3] ONNX Runtime [4] Nvidia TensorRT [5] Intel OpenVINO [6] Apache TVM submitted by /u/emilec___ [link] [comments]  ( 84 min )
    online furniture buying idea: is there a way to guess estimate the length width depth of a room just by taking a photo of it?
    submitted by /u/wilsonckao [link] [comments]  ( 83 min )
    Are there any really good story AI's?
    I have tried a few but most seem very random and unable to really make a ok story. I'm looking for a AI that could maybe be used to get a story started? Maybe AI has just not reached the point it can do this yet? submitted by /u/ryan7251 [link] [comments]  ( 83 min )
    Disco DIffusion Warp
    I am going to be posting a video later looking at Disco Diffusions Warp/Flow along with a basic tutorial on using init videos ,here are a couple of stills from 2 of the videos and some weekly images, all created with disco diffusion ​ https://preview.redd.it/eu3esyv9hg891.png?width=2560&format=png&auto=webp&s=36556998464bbf57694b1be9782632902856fd51 https://preview.redd.it/e82ynuv9hg891.png?width=2560&format=png&auto=webp&s=6dbbf93362a81d4ce897e61e316c8eb68abc7d95 https://preview.redd.it/b9fehwv9hg891.png?width=1920&format=png&auto=webp&s=fc70a0fa607dd4125dd47cce40b4fd820acb9640 submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
  • Open

    Use a custom image to bring your own development environment to RStudio on Amazon SageMaker
    RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench in cloud. You can quickly launch the familiar RStudio integrated development environment (IDE), and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale. RStudio on […]  ( 11 min )
    Text classification for online conversations with machine learning on AWS
    Online conversations are ubiquitous in modern life, spanning industries from video games to telecommunications. This has led to an exponential growth in the amount of online conversation data, which has helped in the development of state-of-the-art natural language processing (NLP) systems like chatbots and natural language generation (NLG) models. Over time, various NLP techniques for […]  ( 11 min )
    Hyperparameter optimization for fine-tuning pre-trained transformer models from Hugging Face
    Large attention-based transformer models have obtained massive gains on natural language processing (NLP). However, training these gigantic networks from scratch requires a tremendous amount of data and compute. For smaller NLP datasets, a simple yet effective strategy is to use a pre-trained transformer, usually trained in an unsupervised fashion on very large datasets, and fine-tune […]  ( 7 min )
    Diagnose model performance before deployment for Amazon Fraud Detector
    With the growth in adoption of online applications and the rising number of internet users, digital fraud is on the rise year over year. Amazon Fraud Detector provides a fully managed service to help you better identify potentially fraudulent online activities using advanced machine learning (ML) techniques, and more than 20 years of fraud detection […]  ( 17 min )
  • Open

    Three Ways to Build Machine Learning Models in Keras
    If you’ve looked at Keras models on Github, you’ve probably noticed that there are some different ways to create models in Keras. There’s the Sequential model which allows you to define an entire model in a single line, usually with some line breaks for readability, then there’s the functional interface that allows for more complicated […] The post Three Ways to Build Machine Learning Models in Keras appeared first on Machine Learning Mastery.  ( 24 min )
  • Open

    What are the top journals for reinforcement learning?
    Hello, I was searching for journals dedicated only for reinforcement learning. To my dissapointment, I found none. I expected there would be one or two as there are some journals that focus on neural newtorks. May I ask for some recommendations on journals for RL? I want to read about the state of the art and get some ideas for my research. I plan to publish my research at the end of the year. It is a study on state representations and models for optimizing a n-step process. So far I have found that a similar approach has been published in the IEEE. Would you be kind enough to recommend me some journals for RL? Is there any ranking that shows the difficulty of publishing in each of the journals? I am new to publishing. Thanks in advance. submitted by /u/ElvishChampion [link] [comments]  ( 83 min )
    Any academic source about Q-table sizes
    Can anyone point me to a source that talks about table sizes reasonable for Q-table learning? I see comments from the experience of people implementing it but I want to cite an academic source that talks about it. I used a Q-table in my work and the size is reasonable but I need to cite a source to support my argument. And all the papers I see only talk about curse of dimensionality and move on to deep neural nets discussion. submitted by /u/Simple-Soil-230 [link] [comments]  ( 85 min )
    Inverted pendulum: How to weight the features?
    The game state of the inverted pendulum problem consists of four variables: cart pos, cart velocity, pole angle and pole velocity. To determine the costs of the current state, the variables have to be aggregated into a single evaluation function. The problem is, that it's possible to weight each feature differently. So the question is, if the cart's position is more important than the pole's angle? submitted by /u/ManuelRodriguez331 [link] [comments]  ( 85 min )
    Continuous action probability calculation in policy gradient
    Hi, I wonder how can we assume y value of gaussian distribution according to action(x) as probability. I understand that the y value of pdf is not probability, but interval integration means probability. someone can explain about this? Thanks import torch from torch.distributions import Normal mean, std = 0, 1 dist = Normal(mean, std) sample = torch.tensor(0) logprob = dist.log_prob(sample) print(logprob.exp()) ## torch(0.3989) import math def normpdf(x, mean, sd): #https://stackoverflow.com/questions/12412895/how-to-calculate-probability-in-a-normal-distribution-given-mean-standard-devi var = float(sd)**2 denom = (2*math.pi*var)**.5 num = math.exp(-(float(x)-float(mean))**2/(2*var)) return num/denom print(normpdf(sample,mean,std)) ## 0.3989 submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 83 min )
  • Open

    The Metaverse Goes Industrial: Siemens, NVIDIA Extend Partnership to Bring Digital Twins Within Easy Reach
    Silicon Valley magic met Wednesday with 175 years of industrial technology leadership as Siemens CEO Roland Busch and NVIDIA Founder and CEO Jensen Huang shared their vision for an “industrial metaverse” at the launch of the Siemens Xcelerator business platform in Munich. “When we combine the real and digital worlds we can achieve new levels Read article > The post The Metaverse Goes Industrial: Siemens, NVIDIA Extend Partnership to Bring Digital Twins Within Easy Reach appeared first on NVIDIA Blog.  ( 8 min )
    NVIDIA, Partners Show Leading AI Performance and Versatility in MLPerf
    NVIDIA and its partners continued to provide the best overall AI training performance and the most submissions across all benchmarks with 90% of all entries coming from the ecosystem, according to MLPerf benchmarks released today. The NVIDIA AI platform covered all eight benchmarks in the MLPerf Training 2.0 round, highlighting its leading versatility. No other Read article > The post NVIDIA, Partners Show Leading AI Performance and Versatility in MLPerf appeared first on NVIDIA Blog.  ( 7 min )
    NVIDIA Studio Driver Elevates Creative Workflows in Blender 3.2, BorisFX Sapphire and Topaz Denoise AI
    The June NVIDIA Studio Driver is available for download today, optimizing the latest creative app updates, all with the stability and reliability that users count on. Creators with NVIDIA RTX GPUs will benefit from faster performance and new features within Blender version 3.2, BorisFX Sapphire release 2022.5 and Topaz Denoise AI 3.7.0. The post NVIDIA Studio Driver Elevates Creative Workflows in Blender 3.2, BorisFX Sapphire and Topaz Denoise AI appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    Brain Power Level AI Supercomputer Has 174 Trillion Parameters | AI Robot Arm Learns With Vision | Vista 2.0 For Autonomous Vehicles | Brain Tumor Detection AI Better Than Humans
    submitted by /u/tohelpyou88 [link] [comments]  ( 83 min )
    A Step-by-Step Walkthrough Neural Networks for Time-series Forecasting
    submitted by /u/lucapiccinelli [link] [comments]  ( 83 min )
  • Open

    Top 12 Logistics Technological Trends to Watch Out in 2022
    Over the past two decades, fast-evolving technology, growing customer expectations, and implementation of new business models have… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 17 min )
  • Open

    Introducing the Microsoft Climate Research Initiative
    Addressing and mitigating the effects of climate change requires a collective effort, bringing our strengths to bear across industry, government, academia, and civil society. The post Introducing the Microsoft Climate Research Initiative appeared first on Microsoft Research.  ( 10 min )
  • Open

    Definitive Guide: An Insight Look At PHP Workers
    Have you ever browsed through your favorite online ecommerce site and, as you were checking out, ended up with a 504 error after a delay? Or perhaps you were browsing your favorite sports site, and as you attempt to load another page, it takes a while to load back with a timeout error? These situations… Read More »Definitive Guide: An Insight Look At PHP Workers The post Definitive Guide: An Insight Look At PHP Workers appeared first on Data Science Central.  ( 20 min )
    DSC Weekly 28 June 2022: Strokes, AI and Cognition
    Regular readers may have noticed that DSC Weekly didn’t come out last week. The reason was personal – a close relative of mine had a series of strokes over the last couple of weeks, and I needed to take some time away to deal with the consequences. In addition, we migrated over to a new… Read More »DSC Weekly 28 June 2022: Strokes, AI and Cognition The post DSC Weekly 28 June 2022: Strokes, AI and Cognition appeared first on Data Science Central.  ( 20 min )
  • Open

    Long Range Language Modeling via Gated State Spaces. (arXiv:2206.13947v1 [cs.LG])
    State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.
    Empirical Study of Quality Image Assessment for Synthesis of Fetal Head Ultrasound Imaging with DCGANs. (arXiv:2206.01731v2 [eess.IV] UPDATED)
    In this work, we present an empirical study of DCGANs, including hyperparameter heuristics and image quality assessment, as a way to address the scarcity of datasets to investigate fetal head ultrasound. We present experiments to show the impact of different image resolutions, epochs, dataset size input, and learning rates for quality image assessment on four metrics: mutual information (MI), Fr\'echet inception distance (FID), peak-signal-to-noise ratio (PSNR), and local binary pattern vector (LBPv). The results show that FID and LBPv have stronger relationship with clinical image quality scores. The resources to reproduce this work are available at \url{https://github.com/budai4medtech/miua2022}.
    A View Independent Classification Framework for Yoga Postures. (arXiv:2206.13577v1 [cs.CV])
    Yoga is a globally acclaimed and widely recommended practice for a healthy living. Maintaining correct posture while performing a Yogasana is of utmost importance. In this work, we employ transfer learning from Human Pose Estimation models for extracting 136 key-points spread all over the body to train a Random Forest classifier which is used for estimation of the Yogasanas. The results are evaluated on an in-house collected extensive yoga video database of 51 subjects recorded from 4 different camera angles. We propose a 3 step scheme for evaluating the generalizability of a Yoga classifier by testing it on 1) unseen frames, 2) unseen subjects, and 3) unseen camera angles. We argue that for most of the applications, validation accuracies on unseen subjects and unseen camera angles would be most important. We empirically analyze over three public datasets, the advantage of transfer learning and the possibilities of target leakage. We further demonstrate that the classification accuracies critically depend on the cross validation method employed and can often be misleading. To promote further research, we have made key-points dataset and code publicly available.
    Accurate and fast identification of minimally prepared bacteria phenotypes using Raman spectroscopy assisted by machine learning. (arXiv:2206.13933v1 [cs.LG])
    The worldwide increase of antimicrobial resistance (AMR) is a serious threat to human health. To avert the spread of AMR, fast reliable diagnostics tools that facilitate optimal antibiotic stewardship are an unmet need. In this regard, Raman spectroscopy promises rapid label- and culture-free identification and antimicrobial susceptibility testing (AST) in a single step. However, even though many Raman-based bacteria-identification and AST studies have demonstrated impressive results, some shortcomings must be addressed. To bridge the gap between proof-of-concept studies and clinical application, we have developed machine learning techniques in combination with a novel data-augmentation algorithm, for fast identification of minimally prepared bacteria phenotypes and the distinctions of methicillin-resistant (MR) from methicillin-susceptible (MS) bacteria. For this we have implemented a spectral transformer model for hyper-spectral Raman images of bacteria. We show that our model outperforms the standard convolutional neural network models on a multitude of classification problems, both in terms of accuracy and in terms of training time. We attain more than 96$\%$ classification accuracy on a dataset consisting of 15 different classes and 95.6$\%$ classification accuracy for six MR-MS bacteria species. More importantly, our results are obtained using only fast and easy-to-produce training and test data
    Secure Distributed Training at Scale. (arXiv:2106.11257v3 [cs.LG] UPDATED)
    Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.
    SkipNode: On Alleviating Over-smoothing for Deep Graph Convolutional Networks. (arXiv:2112.11628v2 [cs.LG] UPDATED)
    Over-smoothing is a challenging problem, which degrades the performance of deep graph convolutional networks (GCNs). However, existing studies for alleviating the over-smoothing problem lack either generality or effectiveness. In this paper, we analyze the underlying issues behind the over-smoothing problem, i.e., feature-diversity degeneration, gradient vanishing, and model weights over-decaying. Inspired by this, we propose a simple yet effective plug-and-play module, SkipNode, to alleviate over-smoothing. Specifically, for each middle layer of a GCN model, SkipNode randomly (or based on node degree) selects nodes to skip the convolutional operation by directly feeding their input features to the nonlinear function. Analytically, 1) skipping the convolutional operation prevents the features from losing diversity; and 2) the "skipped" nodes enable gradients to be directly passed back, thus mitigating the gradient vanishing and model weights over-decaying issues. To demonstrate the superiority of SkipNode, we conduct extensive experiments on nine popular datasets, including both homophilic and heterophilic graphs, with different graph sizes on two typical tasks: node classification and link prediction. Specifically, 1) SkipNode has strong generalizability of being applied to various GCN-based models on different datasets and tasks; and 2) SkipNode outperforms recent state-of-the-art anti-over-smoothing plug-and-play modules, i.e., DropEdge and DropNode, in different settings. Code will be made publicly available on GitHub.
    Critical Investigation of Failure Modes in Physics-informed Neural Networks. (arXiv:2206.09961v2 [cs.LG] UPDATED)
    Several recent works in scientific machine learning have revived interest in the application of neural networks to partial differential equations (PDEs). A popular approach is to aggregate the residual form of the governing PDE and its boundary conditions as soft penalties into a composite objective/loss function for training neural networks, which is commonly referred to as physics-informed neural networks (PINNs). In the present study, we visualize the loss landscapes and distributions of learned parameters and explain the ways this particular formulation of the objective function may hinder or even prevent convergence when dealing with challenging target solutions. We construct a purely data-driven loss function composed of both the boundary loss and the domain loss. Using this data-driven loss function and, separately, a physics-informed loss function, we then train two neural network models with the same architecture. We show that incomparable scales between boundary and domain loss terms are the culprit behind the poor performance. Additionally, we assess the performance of both approaches on two elliptic problems with increasingly complex target solutions. Based on our analysis of their loss landscapes and learned parameter distributions, we observe that a physics-informed neural network with a composite objective function formulation produces highly non-convex loss surfaces that are difficult to optimize and are more prone to the problem of vanishing gradients.
    Personalized Keyword Spotting through Multi-task Learning. (arXiv:2206.13708v1 [cs.SD])
    Keyword spotting (KWS) plays an essential role in enabling speech-based user interaction on smart devices, and conventional KWS (C-KWS) approaches have concentrated on detecting user-agnostic pre-defined keywords. However, in practice, most user interactions come from target users enrolled in the device which motivates to construct personalized keyword spotting. We design two personalized KWS tasks; (1) Target user Biased KWS (TB-KWS) and (2) Target user Only KWS (TO-KWS). To solve the tasks, we propose personalized keyword spotting through multi-task learning (PK-MTL) that consists of multi-task learning and task-adaptation. First, we introduce applying multi-task learning on keyword spotting and speaker verification to leverage user information to the keyword spotting system. Next, we design task-specific scoring functions to adapt to the personalized KWS tasks thoroughly. We evaluate our framework on conventional and personalized scenarios, and the results show that PK-MTL can dramatically reduce the false alarm rate, especially in various practical scenarios.
    Increasing Confidence in Adversarial Robustness Evaluations. (arXiv:2206.13991v1 [cs.LG])
    Hundreds of defenses have been proposed to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these defenses held up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust. In this paper, we propose a test to identify weak attacks, and thus weak defense evaluations. Our test slightly modifies a neural network to guarantee the existence of an adversarial example for every sample. Consequentially, any correct attack must succeed in breaking this modified network. For eleven out of thirteen previously-published defenses, the original evaluation of the defense fails our test, while stronger attacks that break these defenses pass it. We hope that attack unit tests - such as ours - will be a major component in future robustness evaluations and increase confidence in an empirical field that is currently riddled with skepticism.
    Learning the Solution Operator of Boundary Value Problems using Graph Neural Networks. (arXiv:2206.14092v1 [cs.LG])
    As an alternative to classical numerical solvers for partial differential equations (PDEs) subject to boundary value constraints, there has been a surge of interest in investigating neural networks that can solve such problems efficiently. In this work, we design a general solution operator for two different time-independent PDEs using graph neural networks (GNNs) and spectral graph convolutions. We train the networks on simulated data from a finite elements solver on a variety of shapes and inhomogeneities. In contrast to previous works, we focus on the ability of the trained operator to generalize to previously unseen scenarios. Specifically, we test generalization to meshes with different shapes and superposition of solutions for a different number of inhomogeneities. We find that training on a diverse dataset with lots of variation in the finite element meshes is a key ingredient for achieving good generalization results in all cases. With this, we believe that GNNs can be used to learn solution operators that generalize over a range of properties and produce solutions much faster than a generic solver. Our dataset, which we make publicly available, can be used and extended to verify the robustness of these models under varying conditions.
    How to Steer Your Adversary: Targeted and Efficient Model Stealing Defenses with Gradient Redirection. (arXiv:2206.14157v1 [cs.LG])
    Model stealing attacks present a dilemma for public machine learning APIs. To protect financial investments, companies may be forced to withhold important information about their models that could facilitate theft, including uncertainty estimates and prediction explanations. This compromise is harmful not only to users but also to external transparency. Model stealing defenses seek to resolve this dilemma by making models harder to steal while preserving utility for benign users. However, existing defenses have poor performance in practice, either requiring enormous computational overheads or severe utility trade-offs. To meet these challenges, we present a new approach to model stealing defenses called gradient redirection. At the core of our approach is a provably optimal, efficient algorithm for steering an adversary's training updates in a targeted manner. Combined with improvements to surrogate networks and a novel coordinated defense strategy, our gradient redirection defense, called GRAD${}^2$, achieves small utility trade-offs and low computational overhead, outperforming the best prior defenses. Moreover, we demonstrate how gradient redirection enables reprogramming the adversary with arbitrary behavior, which we hope will foster work on new avenues of defense.
    Cost-Efficient Distributed Learning via Combinatorial Multi-Armed Bandits. (arXiv:2202.08302v2 [cs.IT] UPDATED)
    We consider the distributed SGD problem, where a main node distributes gradient calculations among $n$ workers. By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the algorithm's error with its runtime by gradually increasing $k$ as the algorithm evolves. However, this strategy, referred to as adaptive $k$-sync, neglects the cost of unused computations and of communicating models to workers that reveal a straggling behavior. We propose a cost-efficient scheme that assigns tasks only to $k$ workers, and gradually increases $k$. We introduce the use of a combinatorial multi-armed bandit model to learn which workers are the fastest while assigning gradient calculations. Assuming workers with exponentially distributed response times parameterized by different means, we give empirical and theoretical guarantees on the regret of our strategy, i.e., the extra time spent to learn the mean response times of the workers. Furthermore, we propose and analyze a strategy applicable to a large class of response time distributions. Compared to adaptive $k$-sync, our scheme achieves significantly lower errors with the same computational efforts and less downlink communication while being inferior in terms of speed.
    Continual Learning with Transformers for Image Classification. (arXiv:2206.14085v1 [cs.LG])
    In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share any information across tasks. As a result, they struggle to scale to a large number of tasks without significant overhead. In this paper, we validate in the computer vision domain a recent solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning using pre-trained Transformers and Adapters on text classification tasks. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model or increasing the number of model parameters over the time. Besides it is significantly faster at inference time compared to the state-of-the-art methods.
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v4 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.
    Let Users Decide: Navigating the Trade-offs between Costs and Robustness in Algorithmic Recourse. (arXiv:2203.06768v2 [cs.LG] UPDATED)
    As machine learning (ML) models are increasingly being employed to make consequential decisions, there has been a growing interest in developing techniques which can provide recourse to affected individuals. Majority of these techniques provide recourse under the assumption that the affected individuals will implement the prescribed recourses \emph{exactly}. However, recourses often get implemented in a noisy and inconsistent manner due to a variety of reasons e.g., an individual who was asked to increase their salary by \$500 may get a promotion which comes with a raise of \$505. Motivated by this, we study the problem of recourse invalidation in the face of noisy human responses. More specifically, we theoretically and empirically analyze the behavior of state-of-the-art algorithms, and demonstrate that the recourses generated by these algorithms are very likely to be invalidated (i.e., result in negative outcomes) if small changes are made to them. We further propose a novel framework, EXPECTing noisy responses (\texttt{EXPECT}), which addresses the aforementioned problem by explicitly minimizing the probability of recourse invalidation in the face of noisy responses. Our framework can ensure that the resulting recourses are invalidated at most $r \%$ of the time, where $r$ is provided as input by the end user requesting recourse. By doing so, our framework provides end users with greater control in navigating the trade-offs between recourse costs and robustness to noisy responses. Experimental evaluation with multiple real world datasets demonstrates the efficacy of the proposed framework, and validates our theoretical findings.
    Detecting Arbitrary Order Beneficial Feature Interactions for Recommender Systems. (arXiv:2206.13764v1 [cs.IR])
    Detecting beneficial feature interactions is essential in recommender systems, and existing approaches achieve this by examining all the possible feature interactions. However, the cost of examining all the possible higher-order feature interactions is prohibitive (exponentially growing with the order increasing). Hence existing approaches only detect limited order (e.g., combinations of up to four features) beneficial feature interactions, which may miss beneficial feature interactions with orders higher than the limitation. In this paper, we propose a hypergraph neural network based model named HIRS. HIRS is the first work that directly generates beneficial feature interactions of arbitrary orders and makes recommendation predictions accordingly. The number of generated feature interactions can be specified to be much smaller than the number of all the possible interactions and hence, our model admits a much lower running time. To achieve an effective algorithm, we exploit three properties of beneficial feature interactions, and propose deep-infomax-based methods to guide the interaction generation. Our experimental results show that HIRS outperforms state-of-the-art algorithms by up to 5% in terms of recommendation accuracy.
    Learning Variable Impedance Control for Aerial Sliding on Uneven Heterogeneous Surfaces by Proprioceptive and Tactile Sensing. (arXiv:2206.14122v1 [cs.RO])
    The recent development of novel aerial vehicles capable of physically interacting with the environment leads to new applications such as contact-based inspection. These tasks require the robotic system to exchange forces with partially-known environments, which may contain uncertainties including unknown spatially-varying friction properties and discontinuous variations of the surface geometry. Finding a control strategy that is robust against these environmental uncertainties remains an open challenge. This paper presents a learning-based adaptive control strategy for aerial sliding tasks. In particular, the gains of a standard impedance controller are adjusted in real-time by a policy based on the current control signals, proprioceptive measurements, and tactile sensing. This policy is trained in simulation with simplified actuator dynamics in a student-teacher learning setup. The real-world performance of the proposed approach is verified using a tilt-arm omnidirectional flying vehicle. The proposed controller structure combines data-driven and model-based control methods, enabling our approach to successfully transfer directly and without adaptation from simulation to the real platform. Compared to fine-tuned state of the art interaction control methods we achieve reduced tracking error and improved disturbance rejection.
    MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. (arXiv:2111.12707v4 [cs.CV] UPDATED)
    Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at \url{https://github.com/Vegetebird/MHFormer}.
    Deep Learning-Based Defect Classification and Detection in SEM Images. (arXiv:2206.13505v1 [eess.IV])
    This proposes a novel ensemble deep learning-based model to accurately classify, detect and localize different defect categories for aggressive pitches and thin resists (High NA applications).In particular, we train RetinaNet models using different ResNet, VGGNet architectures as backbone and present the comparison between the accuracies of these models and their performance analysis on SEM images with different types of defect patterns such as bridge, break and line collapses. Finally, we propose a preference-based ensemble strategy to combine the output predictions from different models in order to achieve better performance on classification and detection of defects. As CDSEM images inherently contain a significant level of noise, detailed feature information is often shadowed by noise. For certain resist profiles, the challenge is also to differentiate between a microbridge, footing, break, and zones of probable breaks. Therefore, we have applied an unsupervised machine learning model to denoise the SEM images to remove the False-Positive defects and optimize the effect of stochastic noise on structured pixels for better metrology and enhanced defect inspection. We repeated the defect inspection step with the same trained model and performed a comparative analysis for "robustness" and "accuracy" metric with conventional approach for both noisy/denoised image pair. The proposed ensemble method demonstrates improvement of the average precision metric (mAP) of the most difficult defect classes. In this work we have developed a novel robust supervised deep learning training scheme to accurately classify as well as localize different defect types in SEM images with high degree of accuracy. Our proposed approach demonstrates its effectiveness both quantitatively and qualitatively.
    Graph Condensation via Receptive Field Distribution Matching. (arXiv:2206.13697v1 [cs.LG])
    Graph neural networks (GNNs) enable the analysis of graphs using deep learning, with promising results in capturing structured information in graphs. This paper focuses on creating a small graph to represent the original graph, so that GNNs trained on the size-reduced graph can make accurate predictions. We view the original graph as a distribution of receptive fields and aim to synthesize a small graph whose receptive fields share a similar distribution. Thus, we propose Graph Condesation via Receptive Field Distribution Matching (GCDM), which is accomplished by optimizing the synthetic graph through the use of a distribution matching loss quantified by maximum mean discrepancy (MMD). Additionally, we demonstrate that the synthetic graph generated by GCDM is highly generalizable to a variety of models in evaluation phase and that the condensing speed is significantly improved using this framework.
    Risk Perspective Exploration in Distributional Reinforcement Learning. (arXiv:2206.14170v1 [cs.LG])
    Distributional reinforcement learning demonstrates state-of-the-art performance in continuous and discrete control settings with the features of variance and risk, which can be used to explore. However, the exploration method employing the risk property is hard to find, although numerous exploration methods in Distributional RL employ the variance of return distribution per action. In this paper, we present risk scheduling approaches that explore risk levels and optimistic behaviors from a risk perspective. We demonstrate the performance enhancement of the DMIX algorithm using risk scheduling in a multi-agent setting with comprehensive experiments.
    Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching. (arXiv:2206.13602v1 [cs.LG])
    Pretraining molecular representations is critical in a variety of applications in drug and material discovery due to the limited number of labeled molecules, yet most of existing work focuses on pretraining on 2D molecular graphs. The power of pretraining on 3D geometric structures, however, has been less explored, owning to the difficulty of finding a sufficient proxy task to empower the pretraining to effectively extract essential features from the geometric structures. Motivated by the dynamic nature of 3D molecules, where the continuous motion of a molecule in the 3D Euclidean space forms a smooth potential energy surface, we propose a 3D coordinate denoising pretraining framework to model such an energy landscape. Leveraging a SE(3)-invariant score matching method, we propose SE(3)-DDM where the coordinate denoising proxy task is effectively boiled down to the denoising of the pairwise atomic distances in a molecule. Our comprehensive experiments confirm the effectiveness and robustness of our proposed method.
    Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL. (arXiv:2206.14057v1 [cs.LG])
    While the primary goal of the exploration phase in reward-free reinforcement learning (RF-RL) is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
    Studying Generalization Through Data Averaging. (arXiv:2206.13669v1 [stat.ML])
    The generalization of machine learning models has a complex dependence on the data, model and learning algorithm. We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples to understand their ``typical" behavior. We derive an expression for the gap as a function of the covariance between the model parameter distribution and the train loss, and another expression for the average test performance, showing test generalization only depends on data-averaged parameter distribution and the data-averaged loss. We show that for a large class of model parameter distributions a modified generalization gap is always non-negative. By specializing further to parameter distributions produced by stochastic gradient descent (SGD), along with a few approximations and modeling considerations, we are able to predict some aspects about how the generalization gap and model train and test performance vary as a function of SGD noise. We evaluate these predictions empirically on the Cifar10 classification task based on a ResNet architecture.
    Adaptive Multi-view Rule Discovery for Weakly-Supervised Compatible Products Prediction. (arXiv:2206.13749v1 [cs.LG])
    On e-commerce platforms, predicting if two products are compatible with each other is an important functionality to achieve trustworthy product recommendation and search experience for consumers. However, accurately predicting product compatibility is difficult due to the heterogeneous product data and the lack of manually curated training data. We study the problem of discovering effective labeling rules that can enable weakly-supervised product compatibility prediction. We develop AMRule, a multi-view rule discovery framework that can (1) adaptively and iteratively discover novel rulers that can complement the current weakly-supervised model to improve compatibility prediction; (2) discover interpretable rules from both structured attribute tables and unstructured product descriptions. AMRule adaptively discovers labeling rules from large-error instances via a boosting-style strategy, the high-quality rules can remedy the current model's weak spots and refine the model iteratively. For rule discovery from structured product attributes, we generate composable high-order rules from decision trees; and for rule discovery from unstructured product descriptions, we generate prompt-based rules from a pre-trained language model. Experiments on 4 real-world datasets show that AMRule outperforms the baselines by 5.98% on average and improves rule quality and rule proposal efficiency.
    Supervised Learning with General Risk Functionals. (arXiv:2206.13648v1 [stat.ML])
    Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class. The emergence of risk-sensitive learning requires generalization guarantees for functionals of the loss distribution beyond the expectation. While prior works specialize in uniform convergence of particular functionals, our work provides uniform convergence for a general class of H\"older risk functionals for which the closeness in the Cumulative Distribution Function (CDF) entails closeness in risk. We establish the first uniform convergence results for estimating the CDF of the loss distribution, yielding guarantees that hold simultaneously both over all H\"older risk functionals and over all hypotheses. Thus licensed to perform empirical risk minimization, we develop practical gradient-based methods for minimizing distortion risks (widely studied subset of H\"older risks that subsumes the spectral risks, including the mean, conditional value at risk, cumulative prospect theory risks, and others) and provide convergence guarantees. In experiments, we demonstrate the efficacy of our learning procedure, both in settings where uniform convergence results hold and in high-dimensional settings with deep networks.
    An Expert System for Redesigning Software for Cloud Applications. (arXiv:2109.14569v3 [cs.LG] UPDATED)
    Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.
    Measuring and Clustering Network Attackers using Medium-Interaction Honeypots. (arXiv:2206.13614v1 [cs.CR])
    Network honeypots are often used by information security teams to measure the threat landscape in order to secure their networks. With the advancement of honeypot development, today's medium-interaction honeypots provide a way for security teams and researchers to deploy these active defense tools that require little maintenance on a variety of protocols. In this work, we deploy such honeypots on five different protocols on the public Internet and study the intent and sophistication of the attacks we observe. We then use the information gained to develop a clustering approach that identifies correlations in attacker behavior to discover IPs that are highly likely to be controlled by a single operator, illustrating the advantage of using these honeypots for data collection.
    Hamiltonian Monte Carlo Particle Swarm Optimizer. (arXiv:2206.14134v1 [cs.LG])
    We introduce the Hamiltonian Monte Carlo Particle Swarm Optimizer (HMC-PSO), an optimization algorithm that reaps the benefits of both Exponentially Averaged Momentum PSO and HMC sampling. The coupling of the position and velocity of each particle with Hamiltonian dynamics in the simulation allows for extensive freedom for exploration and exploitation of the search space. It also provides an excellent technique to explore highly non-convex functions while ensuring efficient sampling. We extend the method to approximate error gradients in closed form for Deep Neural Network (DNN) settings. We discuss possible methods of coupling and compare its performance to that of state-of-the-art optimizers on the Golomb's Ruler problem and Classification tasks.
    Benchopt: Reproducible, efficient and collaborative optimization benchmarks. (arXiv:2206.13424v2 [cs.LG] UPDATED)
    Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings.
    Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting. (arXiv:2206.13691v1 [cs.SD])
    Keyword spotting is the task of detecting a keyword in streaming audio. Conventional keyword spotting targets predefined keywords classification, but there is growing attention in few-shot (query-by-example) keyword spotting, e.g., N-way classification given M-shot support samples. Moreover, in real-world scenarios, there can be utterances from unexpected categories (open-set) which need to be rejected rather than classified as one of the N classes. Combining the two needs, we tackle few-shot open-set keyword spotting with a new benchmark setting, named splitGSC. We propose episode-known dummy prototypes based on metric learning to detect an open-set better and introduce a simple and powerful approach, Dummy Prototypical Networks (D-ProtoNets). Our D-ProtoNets shows clear margins compared to recent few-shot open-set recognition (FSOSR) approaches in the suggested splitGSC. We also verify our method on a standard benchmark, miniImageNet, and D-ProtoNets shows the state-of-the-art open-set detection rate in FSOSR.
    A Proposed Bi-LSTM Method to Fake News Detection. (arXiv:2206.13982v1 [cs.CL])
    Recent years have seen an explosion in social media usage, allowing people to connect with others. Since the appearance of platforms such as Facebook and Twitter, such platforms influence how we speak, think, and behave. This problem negatively undermines confidence in content because of the existence of fake news. For instance, false news was a determining factor in influencing the outcome of the U.S. presidential election and other sites. Because this information is so harmful, it is essential to make sure we have the necessary tools to detect and resist it. We applied Bidirectional Long Short-Term Memory (Bi-LSTM) to determine if the news is false or real in order to showcase this study. A number of foreign websites and newspapers were used for data collection. After creating & running the model, the work achieved 84% model accuracy and 62.0 F1-macro scores with training data.
    Tensor Recovery Based on A Novel Non-convex Function Minimax Logarithmic Concave Penalty Function. (arXiv:2206.13506v1 [eess.IV])
    Non-convex relaxation methods have been widely used in tensor recovery problems, and compared with convex relaxation methods, can achieve better recovery results. In this paper, a new non-convex function, Minimax Logarithmic Concave Penalty (MLCP) function, is proposed, and some of its intrinsic properties are analyzed, among which it is interesting to find that the Logarithmic function is an upper bound of the MLCP function. The proposed function is generalized to tensor cases, yielding tensor MLCP and weighted tensor $L\gamma$-norm. Consider that its explicit solution cannot be obtained when applying it directly to the tensor recovery problem. Therefore, the corresponding equivalence theorems to solve such problem are given, namely, tensor equivalent MLCP theorem and equivalent weighted tensor $L\gamma$-norm theorem. In addition, we propose two EMLCP-based models for classic tensor recovery problems, namely low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA), and design proximal alternate linearization minimization (PALM) algorithms to solve them individually. Furthermore, based on the Kurdyka-{\L}ojasiwicz property, it is proved that the solution sequence of the proposed algorithm has finite length and converges to the critical point globally. Finally, Extensive experiments show that proposed algorithm achieve good results, and it is confirmed that the MLCP function is indeed better than the Logarithmic function in the minimization problem, which is consistent with the analysis of theoretical properties.
    Stochastic linear optimization never overfits with quadratically-bounded losses on general data. (arXiv:2202.06915v2 [cs.LG] UPDATED)
    This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on Markov chains (all prior stochastic TD bounds are in expectation).
    Haul Road Mapping from GPS Traces. (arXiv:2206.13936v1 [cs.LG])
    Automation in mining requires accurate maps of road networks on site. Because roads on open-cut mines are dynamic in nature and continuously changing, manually updating road maps is tedious and error-prone. This paper investigates the possibility of automatically deriving an accurate representation of the road network using GPS data available from haul trucks operating on site. We present an overview of approaches proposed in literature and test the performance of publicly available methods on GPS data collected from trucks operating on site. Based on shortcomings seen in all tested algorithms, a post-processing step is developed which geometrically analyses the created road map for artefacts typical of free-drive areas on mine sites and significantly improves the quality of the final road network graph.
    Functional Optimization Reinforcement Learning for Real-Time Bidding. (arXiv:2206.13939v1 [cs.AI])
    Real-time bidding is the new paradigm of programmatic advertising. An advertiser wants to make the intelligent choice of utilizing a \textbf{Demand-Side Platform} to improve the performance of their ad campaigns. Existing approaches are struggling to provide a satisfactory solution for bidding optimization due to stochastic bidding behavior. In this paper, we proposed a multi-agent reinforcement learning architecture for RTB with functional optimization. We designed four agents bidding environment: three Lagrange-multiplier based functional optimization agents and one baseline agent (without any attribute of functional optimization) First, numerous attributes have been assigned to each agent, including biased or unbiased win probability, Lagrange multiplier, and click-through rate. In order to evaluate the proposed RTB strategy's performance, we demonstrate the results on ten sequential simulated auction campaigns. The results show that agents with functional actions and rewards had the most significant average winning rate and winning surplus, given biased and unbiased winning information respectively. The experimental evaluations show that our approach significantly improve the campaign's efficacy and profitability.
    Distributed Bayesian Online Learning for Cooperative Manipulation. (arXiv:2104.04342v2 [cs.RO] UPDATED)
    For tasks where the dynamics of multiple agents are physically coupled, e.g., in cooperative manipulation, the coordination between the individual agents becomes crucial, which requires exact knowledge of the interaction dynamics. This problem is typically addressed using centralized estimators, which can negatively impact the flexibility and robustness of the overall system. To overcome this shortcoming, we propose a novel distributed learning framework for the exemplary task of cooperative manipulation using Bayesian principles. Using only local state information each agent obtains an estimate of the object dynamics and grasp kinematics. These local estimates are combined using dynamic average consensus. Due to the strong probabilistic foundation of the method, each estimate of the object dynamics and grasp kinematics is accompanied by a measure of uncertainty, which allows to guarantee a bounded prediction error with high probability. Moreover, the Bayesian principles directly allow iterative learning with constant complexity, such that the proposed learning method can be used online in real-time applications. The effectiveness of the approach is demonstrated in a simulated cooperative manipulation task.
    TACTiS: Transformer-Attentional Copulas for Time Series. (arXiv:2202.03528v2 [cs.LG] UPDATED)
    The estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. However, the practical utility of such estimates is limited by how accurately they quantify predictive uncertainty. In this work, we address the problem of estimating the joint predictive distribution of high-dimensional multivariate time series. We propose a versatile method, based on the transformer architecture, that estimates joint distributions using an attention-based decoder that provably learns to mimic the properties of non-parametric copulas. The resulting model has several desirable properties: it can scale to hundreds of time series, supports both forecasting and interpolation, can handle unaligned and non-uniformly sampled data, and can seamlessly adapt to missing data during training. We demonstrate these properties empirically and show that our model produces state-of-the-art predictions on multiple real-world datasets.
    Data Augmentation techniques in time series domain: A survey and taxonomy. (arXiv:2206.13508v1 [cs.LG])
    With the latest advances in deep learning generative models, it has not taken long to take advantage of their remarkable performance in the area of time series. Deep neural networks used to work with time series depend heavily on the breadth and consistency of the datasets used in training. These types of characteristic are not usually abundant in the real world, where they are usually limited and often with privacy constraints that must be guaranteed. Therefore, an effective way is to increase the number of data using \gls{da} techniques, either by adding noise or permutations and by generating new synthetic data. It is systematically review the current state-of-the-art in the area to provide an overview of all available algorithms and proposes a taxonomy of the most relevant researches. The efficiency of the different variants will be evaluated; as a vital part of the process, the different metrics to evaluate the performance and the main problems concerning each model will be analysed. The ultimate goal of this study is to provide a summary of the evolution and performance of areas that produce better results to guide future researchers in this field.
    Disentangling Embedding Spaces with Minimal Distributional Assumptions. (arXiv:2206.13872v1 [stat.ML])
    Interest in understanding and factorizing learned embedding spaces is growing. For instance, recent concept-based explanation techniques analyze a machine learning model in terms of interpretable latent components. Such components have to be discovered in the model's embedding space, e.g., through independent component analysis (ICA) or modern disentanglement learning techniques. While these unsupervised approaches offer a sound formal framework, they either require access to a data generating function or impose rigid assumptions on the data distribution, such as independence of components, that are often violated in practice. In this work, we link conceptual explainability for vision models with disentanglement learning and ICA. This enables us to provide first theoretical results on how components can be identified without requiring any distributional assumptions. From these insights, we derive the disjoint attributions (DA) concept discovery method that is applicable to a broader class of problems than current approaches but yet possesses a formal identifiability guarantee. In an extensive comparison against component analysis and over 300 state-of-the-art disentanglement models, DA stably maintains superior performance, even under varying distributions and correlation strengths.
    RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network. (arXiv:2206.14098v1 [cs.LG])
    This work introduces the RevSilo, the first reversible module for bidirectional multi-scale feature fusion. Like other reversible methods, RevSilo eliminates the need to store hidden activations by recomputing them. Existing reversible methods, however, do not apply to multi-scale feature fusion and are therefore not applicable to a large class of networks. Bidirectional multi-scale feature fusion promotes local and global coherence and has become a de facto design principle for networks targeting spatially sensitive tasks e.g. HRNet and EfficientDet. When paired with high-resolution inputs, these networks achieve state-of-the-art results across various computer vision tasks, but training them requires substantial accelerator memory for saving large, multi-resolution activations. These memory requirements cap network size and limit progress. Using reversible recomputation, the RevSilo alleviates memory issues while still operating across resolution scales. Stacking RevSilos, we create RevBiFPN, a fully reversible bidirectional feature pyramid network. For classification, RevBiFPN is competitive with networks such as EfficientNet while using up to 19.8x lesser training memory. When fine-tuned on COCO, RevBiFPN provides up to a 2.5% boost in AP over HRNet using fewer MACs and a 2.4x reduction in training-time memory.
    Improved Text Classification via Test-Time Augmentation. (arXiv:2206.13607v1 [cs.LG])
    Test-time augmentation -- the aggregation of predictions across transformed examples of test inputs -- is an established technique to improve the performance of image classification models. Importantly, TTA can be used to improve model performance post-hoc, without additional training. Although test-time augmentation (TTA) can be applied to any data modality, it has seen limited adoption in NLP due in part to the difficulty of identifying label-preserving transformations. In this paper, we present augmentation policies that yield significant accuracy improvements with language models. A key finding is that augmentation policy design -- for instance, the number of samples generated from a single, non-deterministic augmentation -- has a considerable impact on the benefit of TTA. Experiments across a binary classification task and dataset show that test-time augmentation can deliver consistent improvements over current state-of-the-art approaches.
    Integral Transforms in a Physics-Informed (Quantum) Neural Network setting: Applications & Use-Cases. (arXiv:2206.14184v1 [quant-ph])
    In many computational problems in engineering and science, function or model differentiation is essential, but also integration is needed. An important class of computational problems include so-called integro-differential equations which include both integrals and derivatives of a function. In another example, stochastic differential equations can be written in terms of a partial differential equation of a probability density function of the stochastic variable. To learn characteristics of the stochastic variable based on the density function, specific integral transforms, namely moments, of the density function need to be calculated. Recently, the machine learning paradigm of Physics-Informed Neural Networks emerged with increasing popularity as a method to solve differential equations by leveraging automatic differentiation. In this work, we propose to augment the paradigm of Physics-Informed Neural Networks with automatic integration in order to compute complex integral transforms on trained solutions, and to solve integro-differential equations where integrals are computed on-the-fly during training. Furthermore, we showcase the techniques in various application settings, numerically simulating quantum computer-based neural networks as well as classical neural networks.
    Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics. (arXiv:2111.01365v2 [cs.LG] UPDATED)
    Offline reinforcement learning leverages large datasets to train policies without interactions with the environment. The learned policies may then be deployed in real-world settings where interactions are costly or dangerous. Current algorithms over-fit to the training dataset and as a consequence perform poorly when deployed to out-of-distribution generalizations of the environment. We aim to address these limitations by learning a Koopman latent representation which allows us to infer symmetries of the system's underlying dynamic. The latter is then utilized to extend the otherwise static offline dataset during training; this constitutes a novel data augmentation framework which reflects the system's dynamic and is thus to be interpreted as an exploration of the environments phase space. To obtain the symmetries we employ Koopman theory in which nonlinear dynamics are represented in terms of a linear operator acting on the space of measurement functions of the system and thus symmetries of the dynamics may be inferred directly. We provide novel theoretical results on the existence and nature of symmetries relevant for control systems such as reinforcement learning settings. Moreover, we empirically evaluate our method on several benchmark offline reinforcement learning tasks and datasets including D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the state-of-the-art of model-free Q-learning methods.
    SHELS: Exclusive Feature Sets for Novelty Detection and Continual Learning Without Class Boundaries. (arXiv:2206.13720v1 [cs.LG])
    While deep neural networks (DNNs) have achieved impressive classification performance in closed-world learning scenarios, they typically fail to generalize to unseen categories in dynamic open-world environments, in which the number of concepts is unbounded. In contrast, human and animal learners have the ability to incrementally update their knowledge by recognizing and adapting to novel observations. In particular, humans characterize concepts via exclusive (unique) sets of essential features, which are used for both recognizing known classes and identifying novelty. Inspired by natural learners, we introduce a Sparse High-level-Exclusive, Low-level-Shared feature representation (SHELS) that simultaneously encourages learning exclusive sets of high-level features and essential, shared low-level features. The exclusivity of the high-level features enables the DNN to automatically detect out-of-distribution (OOD) data, while the efficient use of capacity via sparse low-level features permits accommodating new knowledge. The resulting approach uses OOD detection to perform class-incremental continual learning without known class boundaries. We show that using SHELS for novelty detection results in statistically significant improvements over state-of-the-art OOD detection approaches over a variety of benchmark datasets. Further, we demonstrate that the SHELS model mitigates catastrophic forgetting in a class-incremental learning setting,enabling a combined novelty detection and accommodation framework that supports learning in open-world settings
    RAW-GNN: RAndom Walk Aggregation based Graph Neural Network. (arXiv:2206.13953v1 [cs.LG])
    Graph-Convolution-based methods have been successfully applied to representation learning on homophily graphs where nodes with the same label or similar attributes tend to connect with one another. Due to the homophily assumption of Graph Convolutional Networks (GCNs) that these methods use, they are not suitable for heterophily graphs where nodes with different labels or dissimilar attributes tend to be adjacent. Several methods have attempted to address this heterophily problem, but they do not change the fundamental aggregation mechanism of GCNs because they rely on summation operators to aggregate information from neighboring nodes, which is implicitly subject to the homophily assumption. Here, we introduce a novel aggregation mechanism and develop a RAndom Walk Aggregation-based Graph Neural Network (called RAW-GNN) method. The proposed approach integrates the random walk strategy with graph neural networks. The new method utilizes breadth-first random walk search to capture homophily information and depth-first search to collect heterophily information. It replaces the conventional neighborhoods with path-based neighborhoods and introduces a new path-based aggregator based on Recurrent Neural Networks. These designs make RAW-GNN suitable for both homophily and heterophily graphs. Extensive experimental results showed that the new method achieved state-of-the-art performance on a variety of homophily and heterophily graphs.
    On the universality of the volatility formation process: when machine learning and rough volatility agree. (arXiv:2206.14114v1 [q-fin.ST])
    We train an LSTM network based on a pooled dataset made of hundreds of liquid stocks aiming to forecast the next daily realized volatility for all stocks. Showing the consistent outperformance of this universal LSTM relative to other asset-specific parametric models, we uncover nonparametric evidences of a universal volatility formation mechanism across assets relating past market realizations, including daily returns and volatilities, to current volatilities. A parsimonious parametric forecasting device combining the rough fractional stochastic volatility and quadratic rough Heston models with fixed parameters results in the same level of performance as the universal LSTM, which confirms the universality of the volatility formation process from a parametric perspective.
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v4 [cs.LG] UPDATED)
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.
    AI-based computer-aided diagnostic system of chest digital tomography synthesis: Demonstrating comparative advantage with X-ray-based AI systems. (arXiv:2206.13504v1 [eess.IV])
    Compared with chest X-ray (CXR) imaging, which is a single image projected from the front of the patient, chest digital tomosynthesis (CDTS) imaging can be more advantageous for lung lesion detection because it acquires multiple images projected from multiple angles of the patient. Various clinical comparative analysis and verification studies have been reported to demonstrate this, but there were no artificial intelligence (AI)-based comparative analysis studies. Existing AI-based computer-aided detection (CAD) systems for lung lesion diagnosis have been developed mainly based on CXR images; however, CAD-based on CDTS, which uses multi-angle images of patients in various directions, has not been proposed and verified for its usefulness compared to CXR-based counterparts. This study develops/tests a CDTS-based AI CAD system to detect lung lesions to demonstrate performance improvements compared to CXR-based AI CAD. We used multiple projection images as input for the CDTS-based AI model and a single-projection image as input for the CXR-based AI model to fairly compare and evaluate the performance between models. The proposed CDTS-based AI CAD system yielded sensitivities of 0.782 and 0.785 and accuracies of 0.895 and 0.837 for the performance of detecting tuberculosis and pneumonia, respectively, against normal subjects. These results show higher performance than sensitivities of 0.728 and 0.698 and accuracies of 0.874 and 0.826 for detecting tuberculosis and pneumonia through the CXR-based AI CAD, which only uses a single projection image in the frontal direction. We found that CDTS-based AI CAD improved the sensitivity of tuberculosis and pneumonia by 5.4% and 8.7% respectively, compared to CXR-based AI CAD without loss of accuracy. Therefore, we comparatively prove that CDTS-based AI CAD technology can improve performance more than CXR, enhancing the clinical applicability of CDTS.
    Improving Clinical Efficiency and Reducing Medical Errors through NLP-enabled diagnosis of Health Conditions from Transcription Reports. (arXiv:2206.13516v1 [cs.LG])
    Misdiagnosis rates are one of the leading causes of medical errors in hospitals, affecting over 12 million adults across the US. To address the high rate of misdiagnosis, this study utilizes 4 NLP-based algorithms to determine the appropriate health condition based on an unstructured transcription report. From the Logistic Regression, Random Forest, LSTM, and CNNLSTM models, the CNN-LSTM model performed the best with an accuracy of 97.89%. We packaged this model into a authenticated web platform for accessible assistance to clinicians. Overall, by standardizing health care diagnosis and structuring transcription reports, our NLP platform drastically improves the clinical efficiency and accuracy of hospitals worldwide.
    POEM: Out-of-Distribution Detection with Posterior Sampling. (arXiv:2206.13687v1 [cs.LG])
    Out-of-distribution (OOD) detection is indispensable for machine learning models deployed in the open world. Recently, the use of an auxiliary outlier dataset during training (also known as outlier exposure) has shown promising performance. As the sample space for potential OOD data can be prohibitively large, sampling informative outliers is essential. In this work, we propose a novel posterior sampling-based outlier mining framework, POEM, which facilitates efficient use of outlier data and promotes learning a compact decision boundary between ID and OOD data for improved detection. We show that POEM establishes state-of-the-art performance on common benchmarks. Compared to the current best method that uses a greedy sampling strategy, POEM improves the relative performance by 42.0% and 24.2% (FPR95) on CIFAR-10 and CIFAR-100, respectively. We further provide theoretical insights on the effectiveness of POEM for OOD detection.
    Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning. (arXiv:2108.03706v3 [stat.ML] UPDATED)
    The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.
    Utilizing Class Separation Distance for the Evaluation of Corruption Robustness of Machine Learning Classifiers. (arXiv:2206.13405v1 [cs.LG] CROSS LISTED)
    Robustness is a fundamental pillar of Machine Learning (ML) classifiers, substantially determining their reliability. Methods for assessing classifier robustness are therefore essential. In this work, we address the challenge of evaluating corruption robustness in a way that allows comparability and interpretability on a given dataset. We propose a test data augmentation method that uses a robustness distance $\epsilon$ derived from the datasets minimal class separation distance. The resulting MSCR (mean statistical corruption robustness) metric allows a dataset-specific comparison of different classifiers with respect to their corruption robustness. The MSCR value is interpretable, as it represents the classifiers avoidable loss of accuracy due to statistical corruptions. On 2D and image data, we show that the metric reflects different levels of classifier robustness. Furthermore, we observe unexpected optima in classifiers robust accuracy through training and testing classifiers with different levels of noise. While researchers have frequently reported on a significant tradeoff on accuracy when training robust models, we strengthen the view that a tradeoff between accuracy and corruption robustness is not inherent. Our results indicate that robustness training through simple data augmentation can already slightly improve accuracy.
    Quantum Neural Architecture Search with Quantum Circuits Metric and Bayesian Optimization. (arXiv:2206.14115v1 [quant-ph])
    Quantum neural networks are promising for a wide range of applications in the Noisy Intermediate-Scale Quantum era. As such, there is an increasing demand for automatic quantum neural architecture search. We tackle this challenge by designing a quantum circuits metric for Bayesian optimization with Gaussian process. To this goal, we propose a new quantum gates distance that characterizes the gates' action over every quantum state and provide a theoretical perspective on its geometrical properties. Our approach significantly outperforms the benchmark on three empirical quantum machine learning problems including training a quantum generative adversarial network, solving combinatorial optimization in the MaxCut problem, and simulating quantum Fourier transform. Our method can be extended to characterize behaviors of various quantum machine learning models.
    Memory Safe Computations with XLA Compiler. (arXiv:2206.14148v1 [cs.LG])
    Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside the computational framework. This impairs the adoption and use of such algorithms. To address this, we developed an XLA compiler extension that adjusts the computational data-flow representation of an algorithm according to a user-specified memory limit. We show that k-nearest neighbour and sparse Gaussian process regression methods can be run at a much larger scale on a single device, where standard implementations would have failed. Our approach leads to better use of hardware resources. We believe that further focus on removing memory constraints at a compiler level will widen the range of machine learning methods that can be developed in the future.
    Transparent Single-Cell Set Classification with Kernel Mean Embeddings. (arXiv:2201.07322v5 [cs.LG] UPDATED)
    Modern single-cell flow and mass cytometry technologies measure the expression of several proteins of the individual cells within a blood or tissue sample. Each profiled biological sample is thus represented by a set of hundreds of thousands of multidimensional cell feature vectors, which incurs a high computational cost to predict each biological sample's associated phenotype with machine learning models. Such a large set cardinality also limits the interpretability of machine learning models due to the difficulty in tracking how each individual cell influences the ultimate prediction. We propose using Kernel Mean Embedding to encode the cellular landscape of each profiled biological sample. Although our foremost goal is to make a more transparent model, we find that our method achieves comparable or better accuracies than the state-of-the-art gating-free methods through a simple linear classifier. As a result, our model contains few parameters but still performs similarly to deep learning models with millions of parameters. In contrast with deep learning approaches, the linearity and sub-selection step of our model makes it easy to interpret classification results. Analysis further shows that our method admits rich biological interpretability for linking cellular heterogeneity to clinical phenotype.
    Quantifying and Learning Linear Symmetry-Based Disentanglement. (arXiv:2011.06070v4 [cs.LG] UPDATED)
    The definition of Linear Symmetry-Based Disentanglement (LSBD) formalizes the notion of linearly disentangled representations, but there is currently no metric to quantify LSBD. Such a metric is crucial to evaluate LSBD methods and to compare to previous understandings of disentanglement. We propose $\mathcal{D}_\mathrm{LSBD}$, a mathematically sound metric to quantify LSBD, and provide a practical implementation for $\mathrm{SO}(2)$ groups. Furthermore, from this metric we derive LSBD-VAE, a semi-supervised method to learn LSBD representations. We demonstrate the utility of our metric by showing that (1) common VAE-based disentanglement methods don't learn LSBD representations, (2) LSBD-VAE as well as other recent methods can learn LSBD representations, needing only limited supervision on transformations, and (3) various desirable properties expressed by existing disentanglement metrics are also achieved by LSBD representations.
    Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. (arXiv:2203.13339v2 [cs.CL] UPDATED)
    End-to-end speech-to-speech translation (S2ST) without relying on intermediate text representations is a rapidly emerging frontier of research. Recent works have demonstrated that the performance of such direct S2ST systems is approaching that of conventional cascade S2ST when trained on comparable datasets. However, in practice, the performance of direct S2ST is bounded by the availability of paired S2ST training data. In this work, we explore multiple approaches for leveraging much more widely available unsupervised and weakly-supervised speech and text data to improve the performance of direct S2ST based on Translatotron 2. With our most effective approaches, the average translation quality of direct S2ST on 21 language pairs on the CVSS-C corpus is improved by +13.6 BLEU (or +113% relatively), as compared to the previous state-of-the-art trained without additional data. The improvements on low-resource language are even more significant (+398% relatively on average). Our comparative studies suggest future research directions for S2ST and speech representation learning.
    Learning Controllable 3D Level Generators. (arXiv:2206.13623v1 [cs.AI])
    Procedural Content Generation via Reinforcement Learning (PCGRL) foregoes the need for large human-authored data-sets and allows agents to train explicitly on functional constraints, using computable, user-defined measures of quality instead of target output. We explore the application of PCGRL to 3D domains, in which content-generation tasks naturally have greater complexity and potential pertinence to real-world applications. Here, we introduce several PCGRL tasks for the 3D domain, Minecraft (Mojang Studios, 2009). These tasks will challenge RL-based generators using affordances often found in 3D environments, such as jumping, multiple dimensional movement, and gravity. We train an agent to optimize each of these tasks to explore the capabilities of previous research in PCGRL. This agent is able to generate relatively complex and diverse levels, and generalize to random initial states and control targets. Controllability tests in the presented tasks demonstrate their utility to analyze success and failure for 3D generators.
    UMBRELLA: Uncertainty-Aware Model-Based Offline Reinforcement Learning Leveraging Planning. (arXiv:2111.11097v3 [cs.RO] UPDATED)
    Offline reinforcement learning (RL) provides a framework for learning decision-making from offline data and therefore constitutes a promising approach for real-world applications as automated driving. Self-driving vehicles (SDV) learn a policy, which potentially even outperforms the behavior in the sub-optimal data set. Especially in safety-critical applications as automated driving, explainability and transferability are key to success. This motivates the use of model-based offline RL approaches, which leverage planning. However, current state-of-the-art methods often neglect the influence of aleatoric uncertainty arising from the stochastic behavior of multi-agent systems. This work proposes a novel approach for Uncertainty-aware Model-Based Offline REinforcement Learning Leveraging plAnning (UMBRELLA), which solves the prediction, planning, and control problem of the SDV jointly in an interpretable learning-based fashion. A trained action-conditioned stochastic dynamics model captures distinctively different future evolutions of the traffic scene. The analysis provides empirical evidence for the effectiveness of our approach in challenging automated driving simulations and based on a real-world public dataset.
    On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods. (arXiv:2206.13503v2 [cs.LG] UPDATED)
    Machine Learning (ML) models now inform a wide range of human decisions, but using ``black box'' models carries risks such as relying on spurious correlations or errant data. To address this, researchers have proposed methods for supplementing models with explanations of their predictions. However, robust evaluations of these methods' usefulness in real-world contexts have remained elusive, with experiments tending to rely on simplified settings or proxy tasks. We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting by relaxing its simplifying assumptions. Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results. Beyond the present experiment, we believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
    Materials Transformers Language Models for Generative Materials Design: a benchmark study. (arXiv:2206.13578v1 [cond-mat.mtrl-sci])
    Pre-trained transformer language models on large unlabeled corpus have produced state-of-the-art results in natural language processing, organic molecule design, and protein sequence generation. However, no such models have been applied to learn the composition patterns of inorganic materials. Here we train a series of seven modern transformer language models (GPT, GPT-2, GPT-Neo, GPT-J, BLMM, BART, and RoBERTa) using the expanded formulas from material deposited in the ICSD, OQMD, and Materials Projects databases. Six different datasets with/out non-charge-neutral or balanced electronegativity samples are used to benchmark the performances and uncover the generation biases of modern transformer models for the generative design of materials compositions. Our extensive experiments showed that the causal language models based materials transformers can generate chemically valid materials compositions with as high as 97.54\% to be charge neutral and 91.40\% to be electronegativity balanced, which has more than 6 times higher enrichment compared to a baseline pseudo-random sampling algorithm. These models also demonstrate high novelty and their potential in new materials discovery has been proved by their capability to recover the leave-out materials. We also find that the properties of the generated samples can be tailored by training the models with selected training sets such as high-bandgap materials. Our experiments also showed that different models each have their own preference in terms of the properties of the generated samples and their running time complexity varies a lot. We have applied our materials transformer models to discover a set of new materials as validated using DFT calculations.
    Deployment of ML Models using Kubeflow on Different Cloud Providers. (arXiv:2206.13655v1 [cs.LG])
    This project aims to explore the process of deploying Machine learning models on Kubernetes using an open-source tool called Kubeflow [1] - an end-to-end ML Stack orchestration toolkit. We create end-to-end Machine Learning models on Kubeflow in the form of pipelines and analyze various points including the ease of setup, deployment models, performance, limitations and features of the tool. We hope that our project acts almost like a seminar/introductory report that can help vanilla cloud/Kubernetes users with zero knowledge on Kubeflow use Kubeflow to deploy ML models. From setup on different clouds to serving our trained model over the internet - we give details and metrics detailing the performance of Kubeflow.
    Learning to learn online with neuromodulated synaptic plasticity in spiking neural networks. (arXiv:2206.12520v2 [cs.NE] UPDATED)
    We propose that in order to harness our understanding of neuroscience toward machine learning, we must first have powerful tools for training brain-like models of learning. Although substantial progress has been made toward understanding the dynamics of learning in the brain, neuroscience-derived models of learning have yet to demonstrate the same performance capabilities as methods in deep learning such as gradient descent. Inspired by the successes of machine learning using gradient descent, we demonstrate that models of neuromodulated synaptic plasticity from neuroscience can be trained in Spiking Neural Networks (SNNs) with a framework of learning to learn through gradient descent to address challenging online learning problems. This framework opens a new path toward developing neuroscience inspired online learning algorithms.
    Revisiting the Updates of a Pre-trained Model for Few-shot Learning. (arXiv:2205.07874v2 [cs.LG] UPDATED)
    Most of the recent few-shot learning algorithms are based on transfer learning, where a model is pre-trained using a large amount of source data, and the pre-trained model is updated using a small amount of target data afterward. In transfer-based few-shot learning, sophisticated pre-training methods have been widely studied for universal and improved representation. However, there is little study on updating pre-trained models for few-shot learning. In this paper, we compare the two popular updating methods, fine-tuning (i.e., updating the entire network) and linear probing (i.e., updating only the linear classifier), considering the distribution shift between the source and target data. We find that fine-tuning is better than linear probing as the number of samples increases, regardless of distribution shift. Next, we investigate the effectiveness and ineffectiveness of data augmentation when pre-trained models are fine-tuned. Our fundamental analyses demonstrate that careful considerations of the details about updating pre-trained models are required for better few-shot performance.
    Reduced Optimal Power Flow Using Graph Neural Network. (arXiv:2206.13591v1 [eess.SY])
    OPF problems are formulated and solved for power system operations, especially for determining generation dispatch points in real-time. For large and complex power system networks with large numbers of variables and constraints, finding the optimal solution for real-time OPF in a timely manner requires a massive amount of computing power. This paper presents a new method to reduce the number of constraints in the original OPF problem using a graph neural network (GNN). GNN is an innovative machine learning model that utilizes features from nodes, edges, and network topology to maximize its performance. In this paper, we proposed a GNN model to predict which lines would be heavily loaded or congested with given load profiles and generation capacities. Only these critical lines will be monitored in an OPF problem, creating a reduced OPF (ROPF) problem. Significant saving in computing time is expected from the proposed ROPF model. A comprehensive analysis of predictions from the GNN model was also made. It is concluded that the application of GNN for ROPF is able to reduce computing time while retaining solution quality.
    Explaining Any ML Model? -- On Goals and Capabilities of XAI. (arXiv:2206.13888v1 [cs.LG])
    An increasing ubiquity of machine learning (ML) motivates research on algorithms to explain ML models and their predictions -- so-called eXplainable Artificial Intelligence (XAI). Despite many survey papers and discussions, the goals and capabilities of XAI algorithms are far from being well understood. We argue that this is because of a problematic reasoning scheme in XAI literature: XAI algorithms are said to complement ML models with desired properties, such as "interpretability", or "explainability". These properties are in turn assumed to contribute to a goal, like "trust" in an ML system. But most properties lack precise definitions and their relationship to such goals is far from obvious. The result is a reasoning scheme that obfuscates research results and leaves an important question unanswered: What can one expect from XAI algorithms? In this article, we clarify the goals and capabilities of XAI algorithms from a concrete perspective: that of their users. Explaining ML models is only necessary if users have questions about them. We show that users can ask diverse questions, but that only one of them can be answered by current XAI algorithms. Answering this core question can be trivial, difficult or even impossible, depending on the ML application. Based on these insights, we outline which capabilities policymakers, researchers and society can reasonably expect from XAI algorithms.
    Parallel Instance Filtering for Malware Detection. (arXiv:2206.13889v1 [cs.CR])
    Machine learning algorithms are widely used in the area of malware detection. With the growth of sample amounts, training of classification algorithms becomes more and more expensive. In addition, training data sets may contain redundant or noisy instances. The problem to be solved is how to select representative instances from large training data sets without reducing the accuracy. This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF). The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset. Each subset consists of instances that have the same nearest enemy. As a result, the PIF algorithm is fast since subsets are processed independently of each other using parallel computation. We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples. The feature set was extracted using static analysis, and it includes metadata from the portable executable file format. Our experimental results demonstrate that the proposed instance selection algorithm reduces the size of a training data set significantly with the only slightly decreased accuracy. The PIF algorithm outperforms existing instance selection methods used in the experiments in terms of the ratio between average classification accuracy and storage percentage.
    Measure Estimation in the Barycentric Coding Model. (arXiv:2201.12195v2 [stat.ML] UPDATED)
    This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycentric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.
    LiteCON: An All-Photonic Neuromorphic Accelerator for Energy-efficient Deep Learning (Preprint). (arXiv:2206.13861v1 [cs.ET])
    Deep learning is highly pervasive in today's data-intensive era. In particular, convolutional neural networks (CNNs) are being widely adopted in a variety of fields for superior accuracy. However, computing deep CNNs on traditional CPUs and GPUs brings several performance and energy pitfalls. Several novel approaches based on ASIC, FPGA, and resistive-memory devices have been recently demonstrated with promising results. Most of them target only the inference (testing) phase of deep learning. There have been very limited attempts to design a full-fledged deep learning accelerator capable of both training and inference. It is due to the highly compute and memory-intensive nature of the training phase. In this paper, we propose LiteCON, a novel analog photonics CNN accelerator. LiteCON uses silicon microdisk-based convolution, memristor-based memory, and dense-wavelength-division-multiplexing for energy-efficient and ultrafast deep learning. We evaluate LiteCON using a commercial CAD framework (IPKISS) on deep learning benchmark models including LeNet and VGG-Net. Compared to the state-of-the-art, LiteCON improves the CNN throughput, energy efficiency, and computational efficiency by up to 32x, 37x, and 5x respectively with trivial accuracy degradation.
    Feature Learning for Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v1 [cs.LG])
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are strongly distorted or hidden by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate an optimized set of data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, called neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments using synthetic datasets and multiple case studies on real-world datasets.
    Learning Symmetric Rules with SATNet. (arXiv:2206.13998v1 [cs.AI])
    SATNet is a differentiable constraint solver with a custom backpropagation algorithm, which can be used as a layer in a deep-learning system. It is a promising proposal for bridging deep learning and logical reasoning. In fact, SATNet has been successfully applied to learn, among others, the rules of a complex logical puzzle, such as Sudoku, just from input and output pairs where inputs are given as images. In this paper, we show how to improve the learning of SATNet by exploiting symmetries in the target rules of a given but unknown logical puzzle or more generally a logical formula. We present SymSATNet, a variant of SATNet that translates the given symmetries of the target rules to a condition on the parameters of SATNet and requires that the parameters should have a particular parametric form that guarantees the condition. The requirement dramatically reduces the number of parameters to learn for the rules with enough symmetries, and makes the parameter learning of SymSATNet much easier than that of SATNet. We also describe a technique for automatically discovering symmetries of the target rules from examples. Our experiments with Sudoku and Rubik's cube show the substantial improvement of SymSATNet over the baseline SATNet.
    Value Function Decomposition for Iterative Design of Reinforcement Learning Agents. (arXiv:2206.13901v1 [cs.LG])
    Designing reinforcement learning (RL) agents is typically a difficult process that requires numerous design iterations. Learning can fail for a multitude of reasons, and standard RL methods provide too few tools to provide insight into the exact cause. In this paper, we show how to integrate value decomposition into a broad class of actor-critic algorithms and use it to assist in the iterative agent-design process. Value decomposition separates a reward function into distinct components and learns value estimates for each. These value estimates provide insight into an agent's learning and decision-making process and enable new training methods to mitigate common problems. As a demonstration, we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition. SAC-D maintains similar performance to SAC, while learning a larger set of value predictions. We also introduce decomposition-based tools that exploit this information, including a new reward influence metric, which measures each reward component's effect on agent decision-making. Using these tools, we provide several demonstrations of decomposition's use in identifying and addressing problems in the design of both environments and agents. Value decomposition is broadly applicable and easy to incorporate into existing algorithms and workflows, making it a powerful tool in an RL practitioner's toolbox.
    Conditional Contrastive Learning for Improving Fairness in Self-Supervised Learning. (arXiv:2106.02866v2 [cs.LG] UPDATED)
    Contrastive self-supervised learning (SSL) learns an embedding space that maps similar data pairs closer and dissimilar data pairs farther apart. Despite its success, one issue has been overlooked: the fairness aspect of representations learned using contrastive SSL. Without mitigation, contrastive SSL techniques can incorporate sensitive information such as gender or race and cause potentially unfair predictions on downstream tasks. In this paper, we propose a Conditional Contrastive Learning (CCL) approach to improve the fairness of contrastive SSL methods. Our approach samples positive and negative pairs from distributions conditioning on the sensitive attribute, or empirically speaking, sampling positive and negative pairs from the same gender or the same race. We show that our approach provably maximizes the conditional mutual information between the learned representations of the positive pairs, and reduces the effect of the sensitive attribute by taking it as the conditional variable. On seven fairness and vision datasets, we empirically demonstrate that the proposed approach achieves state-of-the-art downstream performances compared to unsupervised baselines and significantly improves the fairness of contrastive SSL models on multiple fairness metrics.
    EMVLight: A Decentralized Reinforcement Learning Framework for Efficient Passage of Emergency Vehicles. (arXiv:2109.05429v3 [cs.LG] UPDATED)
    Emergency vehicles (EMVs) play a crucial role in responding to time-critical events such as medical emergencies and fire outbreaks in an urban area. The less time EMVs spend traveling through the traffic, the more likely it would help save people's lives and reduce property loss. To reduce the travel time of EMVs, prior work has used route optimization based on historical traffic-flow data and traffic signal pre-emption based on the optimal route. However, traffic signal pre-emption dynamically changes the traffic flow which, in turn, modifies the optimal route of an EMV. In addition, traffic signal pre-emption practices usually lead to significant disturbances in traffic flow and subsequently increase the travel time for non-EMVs. In this paper, we propose EMVLight, a decentralized reinforcement learning (RL) framework for simultaneous dynamic routing and traffic signal control. EMVLight extends Dijkstra's algorithm to efficiently update the optimal route for the EMVs in real time as it travels through the traffic network. The decentralized RL agents learn network-level cooperative traffic signal phase strategies that not only reduce EMV travel time but also reduce the average travel time of non-EMVs in the network. This benefit has been demonstrated through comprehensive experiments with synthetic and real-world maps. These experiments show that EMVLight outperforms benchmark transportation engineering techniques and existing RL-based signal control methods.
    Activation Functions in Deep Learning: A Comprehensive Survey and Benchmark. (arXiv:2109.14545v3 [cs.LG] UPDATED)
    Neural networks have shown tremendous growth in recent years to solve numerous problems. Various types of neural networks have been introduced to deal with different types of problems. However, the main goal of any neural network is to transform the non-linearly separable input data into more linearly separable abstract features using a hierarchy of layers. These layers are combinations of linear and nonlinear functions. The most popular and common non-linearity layers are activation functions (AFs), such as Logistic Sigmoid, Tanh, ReLU, ELU, Swish and Mish. In this paper, a comprehensive overview and survey is presented for AFs in neural networks for deep learning. Different classes of AFs such as Logistic Sigmoid and Tanh based, ReLU based, ELU based, and Learning based are covered. Several characteristics of AFs such as output range, monotonicity, and smoothness are also pointed out. A performance comparison is also performed among 18 state-of-the-art AFs with different networks on different types of data. The insights of AFs are presented to benefit the researchers for doing further research and practitioners to select among different choices. The code used for experimental comparison is released at: \url{https://github.com/shivram1987/ActivationFunctions}.
    Learning from human perception to improve automatic speaker verification in style-mismatched conditions. (arXiv:2206.13684v1 [eess.AS])
    Our prior experiments show that humans and machines seem to employ different approaches to speaker discrimination, especially in the presence of speaking style variability. The experiments examined read versus conversational speech. Listeners focused on speaker-specific idiosyncrasies while "telling speakers together", and on relative distances in a shared acoustic space when "telling speakers apart". However, automatic speaker verification (ASV) systems use the same loss function irrespective of target or non-target trials. To improve ASV performance in the presence of style variability, insights learnt from human perception are used to design a new training loss function that we refer to as "CllrCE loss". CllrCE loss uses both speaker-specific idiosyncrasies and relative acoustic distances between speakers to train the ASV system. When using the UCLA speaker variability database, in the x-vector and conditioning setups, CllrCE loss results in significant relative improvements in EER by 1-66%, and minDCF by 1-31% and 1-56%, respectively, when compared to the x-vector baseline. Using the SITW evaluation tasks, which involve different conversational speech tasks, the proposed loss combined with self-attention conditioning results in significant relative improvements in EER by 2-5% and minDCF by 6-12% over baseline. In the SITW case, performance improvements were consistent only with conditioning.
    Persistent homology-based descriptor for machine-learning potential. (arXiv:2206.13727v1 [cs.LG])
    Constructing efficient descriptors that represent atomic configurations is crucial for developing a superior machine-learning potential. Widely used conventional descriptors are based on two- or three-body correlations of atomic distribution. Recently, several limitations of these many-body descriptors in classifying different configurations were revealed, which have detrimental effects on the prediction of physical properties. We proposed a new class of descriptors based on persistent homology. We focused on the two-dimensional visualization of persistent homology, that is, a persistence diagram, as a descriptor of atomic configurations in the form of an image. We demonstrated that convolutional neural network models based on this descriptor provide sufficient accuracy in predicting the mean energies per atom of amorphous graphene and amorphous carbon. Our results provide an avenue for improving machine-learning potential using descriptors that depict both topological and geometric information.
    Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation. (arXiv:2202.02628v2 [cs.LG] UPDATED)
    Data poisoning attacks aim at manipulating model behaviors through distorting training data. Previously, an aggregation-based certified defense, Deep Partition Aggregation (DPA), was proposed to mitigate this threat. DPA predicts through an aggregation of base classifiers trained on disjoint subsets of data, thus restricting its sensitivity to dataset distortions. In this work, we propose an improved certified defense against general poisoning attacks, namely Finite Aggregation. In contrast to DPA, which directly splits the training set into disjoint subsets, our method first splits the training set into smaller disjoint subsets and then combines duplicates of them to build larger (but not disjoint) subsets for training base classifiers. This reduces the worst-case impacts of poison samples and thus improves certified robustness bounds. In addition, we offer an alternative view of our method, bridging the designs of deterministic and stochastic aggregation-based certified defenses. Empirically, our proposed Finite Aggregation consistently improves certificates on MNIST, CIFAR-10, and GTSRB, boosting certified fractions by up to 3.05%, 3.87% and 4.77%, respectively, while keeping the same clean accuracies as DPA's, effectively establishing a new state of the art in (pointwise) certified robustness against data poisoning.
    Efficient Deep Learning Using Non-Volatile Memory Technology. (arXiv:2206.13601v1 [cs.AR])
    Embedded machine learning (ML) systems have now become the dominant platform for deploying ML serving tasks and are projected to become of equal importance for training ML models. With this comes the challenge of overall efficient deployment, in particular low power and high throughput implementations, under stringent memory constraints. In this context, non-volatile memory (NVM) technologies such as STT-MRAM and SOT-MRAM have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While prior work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM++, a comprehensive framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. DeepNVM++ relies on iso-capacity and iso-area performance and energy models for last-level caches implemented using conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 3.8x and 4.7x energy-delay product (EDP) reduction and 2.4x and 2.8x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide up to 2.2x and 2.4x EDP reduction and accommodate 2.3x and 3.3x cache capacity when compared to SRAM, respectively. We also perform a scalability analysis and show that STT-MRAM and SOT-MRAM achieve orders of magnitude EDP reduction when compared to SRAM for large cache capacities. DeepNVM++ is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPUs for DL applications.
    Harnessing the Power of Ego Network Layers for Link Prediction in Online Social Networks. (arXiv:2109.09190v2 [cs.SI] UPDATED)
    Being able to recommend links between users in online social networks is important for users to connect with like-minded individuals as well as for the platforms themselves and third parties leveraging social media information to grow their business. Predictions are typically based on unsupervised or supervised learning, often leveraging simple yet effective graph topological information, such as the number of common neighbors. However, we argue that richer information about personal social structure of individuals might lead to better predictions. In this paper, we propose to leverage well-established social cognitive theories to improve link prediction performance. According to these theories, individuals arrange their social relationships along, on average, five concentric circles of decreasing intimacy. We postulate that relationships in different circles have different importance in predicting new links. In order to validate this claim, we focus on popular feature-extraction prediction algorithms (both unsupervised and supervised) and we extend them to include social-circles awareness. We validate the prediction performance of these circle-aware algorithms against several benchmarks (including their baseline versions as well as node-embedding- and GNN-based link prediction), leveraging two Twitter datasets comprising a community of video gamers and generic users. We show that social-awareness generally provides significant improvements in the prediction performance, beating also state-of-the-art solutions like node2vec and SEAL, and without increasing the computational complexity. Finally, we show that social-awareness can be used in place of using a classifier (which may be costly or impractical) for targeting a specific category of users.
    Survey on the Convergence of Machine Learning and Blockchain. (arXiv:2201.00976v2 [cs.LG] UPDATED)
    Machine learning (ML) has been pervasively researched nowadays and it has been applied in many aspects of real life. Nevertheless, issues of model and data still accompany the development of ML. For instance, training of traditional ML models is limited to the access of data sets, which are generally proprietary; published ML models may soon be out of date without an update of new data and continuous training; malicious data contributors may upload wrongly labeled data that leads to undesirable training results; and the abuse of private data and data leakage also exit. With the utilization of blockchain, an emerging and swiftly developing technology, these problems can be efficiently solved. In this paper, we survey the convergence of collaborative ML and blockchain. Different ways of the combination of these two technologies are investigated and their fields of application are examined. Discussion on the limitations of current research and their future directions are also included.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v2 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(m/N+1/m+\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size of the whole system, $m$ is the worker number, and $1-\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(1/N+{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}+ m^{-\alpha})}/{N^{1-\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.
    Graph-Based Machine Learning Improves Just-in-Time Defect Prediction. (arXiv:2110.05371v2 [cs.SE] UPDATED)
    The increasing complexity of today's software requires the contribution of thousands of developers. This complex collaboration structure makes developers more likely to introduce defect-prone changes that lead to software faults. Determining when these defect-prone changes are introduced has proven challenging, and using traditional machine learning (ML) methods to make these determinations seems to have reached a plateau. In this work, we build contribution graphs consisting of developers and source files to capture the nuanced complexity of changes required to build software. By leveraging these contribution graphs, our research shows the potential of using graph-based ML to improve Just-In-Time (JIT) defect prediction. We hypothesize that features extracted from the contribution graphs may be better predictors of defect-prone changes than intrinsic features derived from software characteristics. We corroborate our hypothesis using graph-based ML for classifying edges that represent defect-prone changes. This new framing of the JIT defect prediction problem leads to remarkably better results. We test our approach on 14 open-source projects and show that our best model can predict whether or not a code change will lead to a defect with an F1 score as high as 77.55$\%$. This represents an increase of as much as 46.72$\%$ over the state-of-the-art in JIT defect prediction. We describe limitations, open challenges, and how this method can be used for operational JIT defect prediction.
    Offline Reinforcement Learning with Realizability and Single-policy Concentrability. (arXiv:2202.04634v3 [cs.LG] UPDATED)
    Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
    Zero-Shot Building Control. (arXiv:2206.14191v1 [eess.SY])
    Heating and cooling systems in buildings account for 31% of global energy use, much of which are regulated by Rule Based Controllers (RBCs) that neither maximise energy efficiency nor minimise emissions by interacting optimally with the grid. Control via Reinforcement Learning (RL) has been shown to significantly improve building energy efficiency, but existing solutions require pre-training in simulators that are prohibitively expensive to obtain for every building in the world. In response, we show it is possible to perform safe, zero-shot control of buildings by combining ideas from system identification and model-based RL. We call this combination PEARL (Probabilistic Emission-Abating Reinforcement Learning) and show it reduces emissions without pre-training, needing only a three hour commissioning period. In experiments across three varied building energy simulations, we show PEARL outperforms an existing RBC once, and popular RL baselines in all cases, reducing building emissions by as much as 31% whilst maintaining thermal comfort.
    Differentially Private Algorithms for Statistical Verification of Cyber-Physical Systems. (arXiv:2004.00275v2 [cs.LG] UPDATED)
    Statistical model checking is a class of sequential algorithms that can verify specifications of interest on an ensemble of cyber-physical systems (e.g., whether 99% of cars from a batch meet a requirement on their energy efficiency). These algorithms infer the probability that given specifications are satisfied by the systems with provable statistical guarantees by drawing sufficient numbers of independent and identically distributed samples. During the process of statistical model checking, the values of the samples (e.g., a user's car energy efficiency) may be inferred by intruders, causing privacy concerns in consumer-level applications (e.g., automobiles and medical devices). This paper addresses the privacy of statistical model checking algorithms from the point of view of differential privacy. These algorithms are sequential, drawing samples until a condition on their values is met. We show that revealing the number of the samples drawn can violate privacy. We also show that the standard exponential mechanism that randomizes the output of an algorithm to achieve differential privacy fails to do so in the context of sequential algorithms. Instead, we relax the conservative requirement in differential privacy that the sensitivity of the output of the algorithm should be bounded to any perturbation for any data set. We propose a new notion of differential privacy which we call expected differential privacy. Then, we propose a novel expected sensitivity analysis for the sequential algorithm and proposed a corresponding exponential mechanism that randomizes the termination time to achieve the expected differential privacy. We apply the proposed mechanism to statistical model checking algorithms to preserve the privacy of the samples they draw. The utility of the proposed algorithm is demonstrated in a case study.
    Visual Adversarial Imitation Learning using Variational Models. (arXiv:2107.08829v2 [cs.LG] UPDATED)
    Reward function specification, which requires considerable human effort and iteration, remains a major impediment for learning behaviors through deep reinforcement learning. In contrast, providing visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We consider a setting where an agent is provided a fixed dataset of visual demonstrations illustrating how to perform a task, and must learn to solve the task using the provided demonstrations and unsupervised environment interactions. This setting presents a number of challenges including representation learning for visual observations, sample complexity due to high dimensional spaces, and learning instability due to the lack of a fixed reward or learning signal. Towards addressing these challenges, we develop a variational model-based adversarial imitation learning (V-MAIL) algorithm. The model-based approach provides a strong signal for representation learning, enables sample efficiency, and improves the stability of adversarial training by enabling on-policy learning. Through experiments involving several vision-based locomotion and manipulation tasks, we find that V-MAIL learns successful visuomotor policies in a sample-efficient manner, has better stability compared to prior work, and also achieves higher asymptotic performance. We further find that by transferring the learned models, V-MAIL can learn new tasks from visual demonstrations without any additional environment interactions. All results including videos can be found online at \url{https://sites.google.com/view/variational-mail}.
    Hybrid Ensemble for Fake News Detection: An attempt. (arXiv:2206.13981v1 [cs.CL])
    Fake News Detection has been a challenging problem in the field of Machine Learning. Researchers have approached it via several techniques using old Statistical Classification models and modern Deep Learning. Today, with the growing amount of data, developments in the field of NLP and ML, and an increase in the computation power at disposal, there are infinite permutations and combinations to approach this problem from a different perspective. In this paper, we try different methods to tackle Fake News, and try to build, and propose the possibilities of a Hybrid Ensemble combining the classical Machine Learning techniques with the modern Deep Learning Approaches
    An Artificial Neural Network-Based Model Predictive Control for Three-phase Flying Capacitor Multi-Level Inverter. (arXiv:2110.08101v3 [eess.SY] UPDATED)
    Model predictive control (MPC) has been used widely in power electronics due to its simple concept, fast dynamic response, and good reference tracking. However, it suffers from parametric uncertainties, since it directly relies on the mathematical model of the system to predict the optimal switching states to be used at the next sampling time. As a result, uncertain parameters lead to an ill-designed MPC. Thus, this paper offers a model-free control strategy on the basis of artificial neural networks (ANNs), for mitigating the effects of parameter mismatching while having a little negative impact on the inverter's performance. This method includes two related stages. First, MPC is used as an expert to control the studied converter in order to provide a dataset, while, in the second stage, the obtained dataset is utilized to train the proposed ANN. The case study herein is based on a four-level three-cell flying capacitor inverter. In this study, MATLAB/Simulink is used to simulate the performance of the proposed method, taking into account various operating conditions. Afterward, the simulation results are reported in comparison with the conventional MPC scheme, demonstrating the superior performance of the proposed control strategy in terms of robustness against parameters mismatch and low total harmonic distortion (THD), especially when changes occur in the system parameters, compared to the conventional MPC. Furthermore, the experimental validation of the proposed method is provided based on the Hardware-in-the-Loop (HIL) simulation using the C2000TM-microcontroller-LaunchPadXL TMS320F28379D kit, demonstrating the applicability of the ANN-based control strategy to be implemented on a DSP controller.
    QTI Submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design. (arXiv:2206.13909v1 [cs.SD])
    This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Second, we design an efficient architecture, BC-ResNet-Mod, a modified version of the baseline architecture with a limited receptive field. Third, we exploit spectrogram-to-spectrogram translation from one to multiple devices to augment training data. Finally, we utilize three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters.
    BAGEL: A Benchmark for Assessing Graph Neural Network Explanations. (arXiv:2206.13983v1 [cs.LG])
    The problem of interpreting the decisions of machine learning is a well-researched and important. We are interested in a specific type of machine learning model that deals with graph data called graph neural networks. Evaluating interpretability approaches for graph neural networks (GNN) specifically are known to be challenging due to the lack of a commonly accepted benchmark. Given a GNN model, several interpretability approaches exist to explain GNN models with diverse (sometimes conflicting) evaluation methodologies. In this paper, we propose a benchmark for evaluating the explainability approaches for GNNs called Bagel. In Bagel, we firstly propose four diverse GNN explanation evaluation regimes -- 1) faithfulness, 2) sparsity, 3) correctness. and 4) plausibility. We reconcile multiple evaluation metrics in the existing literature and cover diverse notions for a holistic evaluation. Our graph datasets range from citation networks, document graphs, to graphs from molecules and proteins. We conduct an extensive empirical study on four GNN models and nine post-hoc explanation approaches for node and graph classification tasks. We open both the benchmarks and reference implementations and make them available at https://github.com/Mandeep-Rathee/Bagel-benchmark.
    Modeling Extraneous Activity Delays in Business Process Simulation. (arXiv:2206.14051v1 [cs.SE])
    Business Process Simulation (BPS) is a common approach to estimate the impact of changes to a business process on its performance measures. For example, BPS allows us to estimate what would be the cycle time of a process if we automated one of its activities. The starting point of BPS is a business process model annotated with simulation parameters (a BPS model). Several studies have proposed methods to automatically discover BPS models from event logs via process mining. However, current techniques in this space discover BPS models that only capture waiting times caused by resource contention or resource unavailability. Oftentimes, a considerable portion of the waiting time in a business process is caused by extraneous delays, e.g. a resource waits for the customer to return a phone call. This paper proposes a method that discovers extraneous delays from input data, and injects timer events into a BPS model to capture the discovered delays. An empirical evaluation involving synthetic and real-life logs shows that the approach produces BPS models that better reflect the temporal dynamics of the process, relative to BPS models that do not capture extraneous delays.
    Exact Spectral Norm Regularization for Neural Networks. (arXiv:2206.13581v1 [stat.ML])
    We pursue a line of research that seeks to regularize the spectral norm of the Jacobian of the input-output mapping for deep neural networks. While previous work rely on upper bounding techniques, we provide a scheme that targets the exact spectral norm. We showcase that our algorithm achieves an improved generalization performance compared to previous spectral regularization techniques while simultaneously maintaining a strong safeguard against natural and adversarial noise. Moreover, we further explore some previous reasoning concerning the strong adversarial protection that Jacobian regularization provides and show that it can be misleading.
    Towards a Grounded Theory of Causation for Embodied AI. (arXiv:2206.13973v1 [cs.AI])
    There exist well-developed frameworks for causal modelling, but these require rather a lot of human domain expertise to define causal variables and perform interventions. In order to enable autonomous agents to learn abstract causal models through interactive experience, the existing theoretical foundations need to be extended and clarified. Existing frameworks give no guidance regarding variable choice / representation, and more importantly, give no indication as to which behaviour policies or physical transformations of state space shall count as interventions. The framework sketched in this paper describes actions as transformations of state space, for instance induced by an agent running a policy. This makes it possible to describe in a uniform way both transformations of the micro-state space and abstract models thereof, and say when the latter is veridical / grounded / natural. We then introduce (causal) variables, define a mechanism as an invariant predictor, and say when an action can be viewed as a ``surgical intervention'', thus bringing the objective of causal representation & intervention skill learning into clearer focus.
    Deep Structured Prediction for Facial Landmark Detection. (arXiv:2010.09035v1 [cs.CV] CROSS LISTED)
    Existing deep learning based facial landmark detection methods have achieved excellent performance. These methods, however, do not explicitly embed the structural dependencies among landmark points. They hence cannot preserve the geometric relationships between landmark points or generalize well to challenging conditions or unseen data. This paper proposes a method for deep structured facial landmark detection based on combining a deep Convolutional Network with a Conditional Random Field. We demonstrate its superior performance to existing state-of-the-art techniques in facial landmark detection, especially a better generalization ability on challenging datasets that include large pose and occlusion.
    Utility Theory for Sequential Decision Making. (arXiv:2206.13637v1 [cs.AI])
    The von Neumann-Morgenstern (VNM) utility theorem shows that under certain axioms of rationality, decision-making is reduced to maximizing the expectation of some utility function. We extend these axioms to increasingly structured sequential decision making settings and identify the structure of the corresponding utility functions. In particular, we show that memoryless preferences lead to a utility in the form of a per transition reward and multiplicative factor on the future return. This result motivates a generalization of Markov Decision Processes (MDPs) with this structure on the agent's returns, which we call Affine-Reward MDPs. A stronger constraint on preferences is needed to recover the commonly used cumulative sum of scalar rewards in MDPs. A yet stronger constraint simplifies the utility function for goal-seeking agents in the form of a difference in some function of states that we call potential functions. Our necessary and sufficient conditions demystify the reward hypothesis that underlies the design of rational agents in reinforcement learning by adding an axiom to the VNM rationality axioms and motivates new directions for AI research involving sequential decision making.
    Value Function Approximations via Kernel Embeddings for No-Regret Reinforcement Learning. (arXiv:2011.07881v3 [cs.LG] UPDATED)
    We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. In many real-world RL environments, the state and action spaces are continuous or very large. Existing approaches establish regret guarantees by either a low-dimensional representation of the stochastic transition model or an approximation of the $Q$-functions. However, the understanding of function approximation schemes for state-value functions largely remains missing. In this paper, we propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a reproducing kernel Hilbert space while carefully balancing the exploitation-exploration tradeoff. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde{O}\big(H\gamma_N\sqrt{N}\big)$\footnote{ $\tilde{O}(\cdot)$ hides only absolute constant and poly-logarithmic factors.}, where $H$ is the episode length, $N$ is the total number of time steps and $\gamma_N$ is an information theoretic quantity relating the effective dimension of the state-action feature space. Our method bypasses the need for estimating transition probabilities and applies to any domain on which kernels can be defined. It also brings new insights into the general theory of kernel methods for approximate inference and RL regret minimization.
    Equivariant Priors for Compressed Sensing with Unknown Orientation. (arXiv:2206.14069v1 [cs.LG])
    In compressed sensing, the goal is to reconstruct the signal from an underdetermined system of linear measurements. Thus, prior knowledge about the signal of interest and its structure is required. Additionally, in many scenarios, the signal has an unknown orientation prior to measurements. To address such recovery problems, we propose using equivariant generative models as a prior, which encapsulate orientation information in their latent space. Thereby, we show that signals with unknown orientations can be recovered with iterative gradient descent on the latent space of these models and provide additional theoretical recovery guarantees. We construct an equivariant variational autoencoder and use the decoder as generative prior for compressed sensing. We discuss additional potential gains of the proposed approach in terms of convergence and latency.
    Neural Tangent Kernel Analysis of Deep Narrow Neural Networks. (arXiv:2202.02981v2 [cs.LG] UPDATED)
    The tremendous recent progress in analyzing the training dynamics of overparameterized neural networks has primarily focused on wide networks and therefore does not sufficiently address the role of depth in deep learning. In this work, we present the first trainability guarantee of infinitely deep but narrow neural networks. We study the infinite-depth limit of a multilayer perceptron (MLP) with a specific initialization and establish a trainability guarantee using the NTK theory. We then extend the analysis to an infinitely deep convolutional neural network (CNN) and perform brief experiments.
    Label-enhanced Prototypical Network with Contrastive Learning for Multi-label Few-shot Aspect Category Detection. (arXiv:2206.13980v1 [cs.CL])
    Multi-label aspect category detection allows a given review sentence to contain multiple aspect categories, which is shown to be more practical in sentiment analysis and attracting increasing attention. As annotating large amounts of data is time-consuming and labor-intensive, data scarcity occurs frequently in real-world scenarios, which motivates multi-label few-shot aspect category detection. However, research on this problem is still in infancy and few methods are available. In this paper, we propose a novel label-enhanced prototypical network (LPN) for multi-label few-shot aspect category detection. The highlights of LPN can be summarized as follows. First, it leverages label description as auxiliary knowledge to learn more discriminative prototypes, which can retain aspect-relevant information while eliminating the harmful effect caused by irrelevant aspects. Second, it integrates with contrastive learning, which encourages that the sentences with the same aspect label are pulled together in embedding space while simultaneously pushing apart the sentences with different aspect labels. In addition, it introduces an adaptive multi-label inference module to predict the aspect count in the sentence, which is simple yet effective. Extensive experimental results on three datasets demonstrate that our proposed model LPN can consistently achieve state-of-the-art performance.
    Detecting potentially harmful and protective suicide-related content on twitter: A machine learning approach. (arXiv:2112.04796v3 [cs.CL] UPDATED)
    Research shows that exposure to suicide-related news media content is associated with suicide rates, with some content characteristics likely having harmful and others potentially protective effects. Although good evidence exists for a few selected characteristics, systematic large scale investigations are missing in general, and in particular for social media data. We apply machine learning methods to classify large quantities of Twitter data according to a novel annotation scheme that distinguishes 12 categories of suicide-related tweets. We then trained a benchmark of machine learning models including a majority classifier, an approach based on word frequency (TF-IDF with a linear SVM) and two state-of-the-art deep learning models (BERT, XLNet). The two deep learning models achieved the best performance in two classification tasks: In the first task, we classified six main content categories, including personal stories about either suicidal ideation and attempts or coping, calls for action intending to spread either problem awareness or prevention-related information, reporting of suicide cases, and other tweets irrelevant to these categories. The deep learning models reached accuracy scores above 73% on average across the six categories, and F1-scores in between 0.70 and 0.85 for all but the suicidal ideation and attempts category (0.51-0.55). In the second task, separating tweets referring to actual suicide from off-topic tweets, they correctly labeled around 88% of tweets, with BERT achieving F1-scores of 0.93 and 0.74 for the two categories, respectively. These classification performances are comparable to the state-of-the-art on similar tasks. By making data labeling more efficient, this work has enabled large-scale investigations on harmful and protective associations of social media content with suicide rates and help-seeking behavior.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v4 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which observed factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with the correction for selection bias. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that causal models are used to address the continuous treatment effect with observational data in a medical context.
    Learning by Transference: Training Graph Neural Networks on Growing Graphs. (arXiv:2106.03693v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) use graph convolutions to exploit network invariances and learn meaningful feature representations from network data. However, on large-scale graphs convolutions incur in high computational cost, leading to scalability limitations. Leveraging the graphon -- the limit object of a graph -- in this paper we consider the problem of learning a graphon neural network (WNN) -- the limit object of a GNN -- by training GNNs on graphs sampled from the graphon. Under smoothness conditions, we show that: (i) the expected distance between the learning steps on the GNN and on the WNN decreases asymptotically with the size of the graph, and (ii) when training on a sequence of growing graphs, gradient descent follows the learning direction of the WNN. Inspired by these results, we propose a novel algorithm to learn GNNs on large-scale graphs that, starting from a moderate number of nodes, successively increases the size of the graph during training. This algorithm is further benchmarked on a decentralized control problem, where it retains comparable performance to its large-scale counterpart at a reduced computational cost.
    Compressive Clustering with an Optical Processing Unit. (arXiv:2206.05928v2 [cs.LG] UPDATED)
    We explore the use of Optical Processing Units (OPU) to compute random Fourier features for sketching, and adapt the overall compressive clustering pipeline to this setting. We also propose some tools to help tuning a critical hyper-parameter of compressive clustering.
    BeamsNet: A data-driven Approach Enhancing Doppler Velocity Log Measurements for Autonomous Underwater Vehicle Navigation. (arXiv:2206.13603v1 [cs.RO])
    Autonomous underwater vehicles (AUV) perform various applications such as seafloor mapping and underwater structure health monitoring. Commonly, an inertial navigation system aided by a Doppler velocity log (DVL) is used to provide the vehicle's navigation solution. In such fusion, the DVL provides the velocity vector of the AUV, which determines the navigation solution's accuracy and helps estimate the navigation states. This paper proposes BeamsNet, an end-to-end deep learning framework to regress the estimated DVL velocity vector that improves the accuracy of the velocity vector estimate, and could replace the model-based approach. Two versions of BeamsNet, differing in their input to the network, are suggested. The first uses the current DVL beam measurements and inertial sensors data, while the other utilizes only DVL data, taking the current and past DVL measurements for the regression process. Both simulation and sea experiments were made to validate the proposed learning approach relative to the model-based approach. Sea experiments were made with the Snapir AUV in the Mediterranean Sea, collecting approximately four hours of DVL and inertial sensor data. Our results show that the proposed approach achieved an improvement of more than 60% in estimating the DVL velocity vector.
    Deep Neural Networks pruning via the Structured Perspective Regularization. (arXiv:2206.14056v1 [cs.LG])
    In Machine Learning, Artificial Neural Networks (ANNs) are a very powerful tool, broadly used in many applications. Often, the selected (deep) architectures include many layers, and therefore a large amount of parameters, which makes training, storage and inference expensive. This motivated a stream of research about compressing the original networks into smaller ones without excessively sacrificing performances. Among the many proposed compression approaches, one of the most popular is \emph{pruning}, whereby entire elements of the ANN (links, nodes, channels, \ldots) and the corresponding weights are deleted. Since the nature of the problem is inherently combinatorial (what elements to prune and what not), we propose a new pruning method based on Operational Research tools. We start from a natural Mixed-Integer-Programming model for the problem, and we use the Perspective Reformulation technique to strengthen its continuous relaxation. Projecting away the indicator variables from this reformulation yields a new regularization term, which we call the Structured Perspective Regularization, that leads to structured pruning of the initial architecture. We test our method on some ResNet architectures applied to CIFAR-10, CIFAR-100 and ImageNet datasets, obtaining competitive performances w.r.t.~the state of the art for structured pruning.
    Learning the Evolutionary and Multi-scale Graph Structure for Multivariate Time Series Forecasting. (arXiv:2206.13816v1 [cs.LG])
    Recent studies have shown great promise in applying graph neural networks for multivariate time series forecasting, where the interactions of time series are described as a graph structure and the variables are represented as the graph nodes. Along this line, existing methods usually assume that the graph structure (or the adjacency matrix), which determines the aggregation manner of graph neural network, is fixed either by definition or self-learning. However, the interactions of variables can be dynamic and evolutionary in real-world scenarios. Furthermore, the interactions of time series are quite different if they are observed at different time scales. To equip the graph neural network with a flexible and practical graph structure, in this paper, we investigate how to model the evolutionary and multi-scale interactions of time series. In particular, we first provide a hierarchical graph structure cooperated with the dilated convolution to capture the scale-specific correlations among time series. Then, a series of adjacency matrices are constructed under a recurrent manner to represent the evolving correlations at each layer. Moreover, a unified neural network is provided to integrate the components above to get the final prediction. In this way, we can capture the pair-wise correlations and temporal dependency simultaneously. Finally, experiments on both single-step and multi-step forecasting tasks demonstrate the superiority of our method over the state-of-the-art approaches.
    Perceived Overlap: A Prerequisite for VAE Disentanglement. (arXiv:2202.13341v2 [cs.LG] UPDATED)
    Learning disentangled representations with variational autoencoders (VAEs) is often attributed to the regularisation component of the loss. In this work, we highlight the interaction between data and the reconstruction term of the loss as the main contributor to disentanglement in VAEs. We note that standardised benchmark datasets are constructed in ways that are conducive to learning what appear to be disentangled representations. We design an intuitive adversarial dataset that exploits this mechanism to break existing state-of-the-art disentanglement frameworks. Finally, we supply a solution that enables disentanglement by modifying the reconstruction loss, affecting how VAEs perceive distances between data points.
    Structural Entropy Guided Graph Hierarchical Pooling. (arXiv:2206.13510v1 [cs.LG])
    Following the success of convolution on non-Euclidean space, the corresponding pooling approaches have also been validated on various tasks regarding graphs. However, because of the fixed compression quota and stepwise pooling design, these hierarchical pooling methods still suffer from local structure damage and suboptimal problem. In this work, inspired by structural entropy, we propose a hierarchical pooling approach, SEP, to tackle the two issues. Specifically, without assigning the layer-specific compression quota, a global optimization algorithm is designed to generate the cluster assignment matrices for pooling at once. Then, we present an illustration of the local structure damage from previous methods in the reconstruction of ring and grid synthetic graphs. In addition to SEP, we further design two classification models, SEP-G and SEP-N for graph classification and node classification, respectively. The results show that SEP outperforms state-of-the-art graph pooling methods on graph classification benchmarks and obtains superior performance on node classifications.
    Fast Simulation of Particulate Suspensions Enabled by Graph Neural Network. (arXiv:2206.13905v1 [cs.LG])
    Predicting the dynamic behaviors of particles in suspension subject to hydrodynamic interaction (HI) and external drive can be critical for many applications. By harvesting advanced deep learning techniques, the present work introduces a new framework, hydrodynamic interaction graph neural network (HIGNN), for inferring and predicting the particles' dynamics in Stokes suspensions. It overcomes the limitations of traditional approaches in computational efficiency, accuracy, and/or transferability. In particular, by uniting the data structure represented by a graph and the neural networks with learnable parameters, the HIGNN constructs surrogate modeling for the mobility tensor of particles which is the key to predicting the dynamics of particles subject to HI and external forces. To account for the many-body nature of HI, we generalize the state-of-the-art GNN by introducing higher-order connectivity into the graph and the corresponding convolutional operation. For training the HIGNN, we only need the data for a small number of particles in the domain of interest, and hence the training cost can be maintained low. Once constructed, the HIGNN permits fast predictions of the particles' velocities and is transferable to suspensions of different numbers/concentrations of particles in the same domain and to any external forcing. It has the ability to accurately capture both the long-range HI and short-range lubrication effects. We demonstrate the accuracy, efficiency, and transferability of the proposed HIGNN framework in a variety of systems. The requirement on computing resource is minimum: most simulations only require a desktop with one GPU; the simulations for a large suspension of 100,000 particles call for up to 6 GPUs.
    Information Entropy Initialized Concrete Autoencoder for Optimal Sensor Placement and Reconstruction of Geophysical Fields. (arXiv:2206.13968v1 [cs.LG])
    We propose a new approach to the optimal placement of sensors for the problem of reconstructing geophysical fields from sparse measurements. Our method consists of two stages. In the first stage, we estimate the variability of the physical field as a function of spatial coordinates by approximating its information entropy through the Conditional PixelCNN network. To calculate the entropy, a new ordering of a two-dimensional data array (spiral ordering) is proposed, which makes it possible to obtain the entropy of a physical field simultaneously for several spatial scales. In the second stage, the entropy of the physical field is used to initialize the distribution of optimal sensor locations. This distribution is further optimized with the Concrete Autoencoder architecture with the straight-through gradient estimator and adversarial loss to simultaneously minimize the number of sensors and maximize reconstruction accuracy. Our method scales linearly with data size, unlike commonly used Principal Component Analysis. We demonstrate our method on the two examples: (a) temperature and (b) salinity fields around the Barents Sea and the Svalbard group of islands. For these examples, we compute the reconstruction error of our method and a few baselines. We test our approach against two baselines (1) PCA with QR factorization and (2) climatology. We find out that the obtained optimal sensor locations have clear physical interpretation and correspond to the boundaries between sea currents.
    Learning to Iteratively Solve Routing Problems with Dual-Aspect Collaborative Transformer. (arXiv:2110.02544v2 [cs.LG] UPDATED)
    Recently, Transformer has become a prevailing deep architecture for solving vehicle routing problems (VRPs). However, it is less effective in learning improvement models for VRP because its positional encoding (PE) method is not suitable in representing VRP solutions. This paper presents a novel Dual-Aspect Collaborative Transformer (DACT) to learn embeddings for the node and positional features separately, instead of fusing them together as done in existing ones, so as to avoid potential noises and incompatible correlations. Moreover, the positional features are embedded through a novel cyclic positional encoding (CPE) method to allow Transformer to effectively capture the circularity and symmetry of VRP solutions (i.e., cyclic sequences). We train DACT using Proximal Policy Optimization and design a curriculum learning strategy for better sample efficiency. We apply DACT to solve the traveling salesman problem (TSP) and capacitated vehicle routing problem (CVRP). Results show that our DACT outperforms existing Transformer based improvement models, and exhibits much better generalization performance across different problem sizes on synthetic and benchmark instances, respectively.
    DayDreamer: World Models for Physical Robot Learning. (arXiv:2206.14176v1 [cs.RO])
    To solve tasks in complex environments, robots need to learn from experience. Deep reinforcement learning is a common approach to robot learning but requires a large amount of trial and error to learn, limiting its deployment in the physical world. As a consequence, many advances in robot learning rely on simulators. On the other hand, learning inside of simulators fails to capture the complexity of the real world, is prone to simulator inaccuracies, and the resulting behaviors do not adapt to changes in the world. The Dreamer algorithm has recently shown great promise for learning from small amounts of interaction by planning within a learned world model, outperforming pure reinforcement learning in video games. Learning a world model to predict the outcomes of potential actions enables planning in imagination, reducing the amount of trial and error needed in the real environment. However, it is unknown whether Dreamer can facilitate faster learning on physical robots. In this paper, we apply Dreamer to 4 robots to learn online and directly in the real world, without simulators. Dreamer trains a quadruped robot to roll off its back, stand up, and walk from scratch and without resets in only 1 hour. We then push the robot and find that Dreamer adapts within 10 minutes to withstand perturbations or quickly roll over and stand back up. On two different robotic arms, Dreamer learns to pick and place multiple objects directly from camera images and sparse rewards, approaching human performance. On a wheeled robot, Dreamer learns to navigate to a goal position purely from camera images, automatically resolving ambiguity about the robot orientation. Using the same hyperparameters across all experiments, we find that Dreamer is capable of online learning in the real world, establishing a strong baseline. We release our infrastructure for future applications of world models to robot learning.
    Deep Symbolic Regression for Recurrent Sequences. (arXiv:2201.04600v2 [cs.LG] UPDATED)
    Symbolic regression, i.e. predicting a function from the observation of its values, is well-known to be a challenging task. In this paper, we train Transformers to infer the function or recurrence relation underlying sequences of integers or floats, a typical task in human IQ tests which has hardly been tackled in the machine learning literature. We evaluate our integer model on a subset of OEIS sequences, and show that it outperforms built-in Mathematica functions for recurrence prediction. We also demonstrate that our float model is able to yield informative approximations of out-of-vocabulary functions and constants, e.g. $\operatorname{bessel0}(x)\approx \frac{\sin(x)+\cos(x)}{\sqrt{\pi x}}$ and $1.644934\approx \pi^2/6$. An interactive demonstration of our models is provided at https://symbolicregression.metademolab.com.
    Fire Together Wire Together: A Dynamic Pruning Approach with Self-Supervised Mask Prediction. (arXiv:2110.08232v3 [cs.CV] UPDATED)
    Dynamic model pruning is a recent direction that allows for the inference of a different sub-network for each input sample during deployment. However, current dynamic methods rely on learning a continuous channel gating through regularization by inducing sparsity loss. This formulation introduces complexity in balancing different losses (e.g task loss, regularization loss). In addition, regularization based methods lack transparent tradeoff hyperparameter selection to realize computational budget. Our contribution is two-fold: 1) decoupled task and pruning training. 2) Simple hyperparameter selection that enables FLOPs reduction estimation before training. Inspired by the Hebbian theory in Neuroscience: "neurons that fire together wire together", we propose to predict a mask to process k filters in a layer based on the activation of its previous layer. We pose the problem as a self-supervised binary classification problem. Each mask predictor module is trained to predict if the log-likelihood for each filter in the current layer belongs to the top-k activated filters. The value k is dynamically estimated for each input based on a novel criterion using the mass of heatmaps. We show experiments on several neural architectures, such as VGG, ResNet and MobileNet on CIFAR and ImageNet datasets. On CIFAR, we reach similar accuracy to SOTA methods with 15% and 24% higher FLOPs reduction. Similarly in ImageNet, we achieve lower drop in accuracy with up to 13% improvement in FLOPs reduction.
    Solving the Real Robot Challenge using Deep Reinforcement Learning. (arXiv:2109.15233v3 [cs.RO] UPDATED)
    This paper details our winning submission to Phase 1 of the 2021 Real Robot Challenge; a challenge in which a three-fingered robot must carry a cube along specified goal trajectories. To solve Phase 1, we use a pure reinforcement learning approach which requires minimal expert knowledge of the robotic system, or of robotic grasping in general. A sparse, goal-based reward is employed in conjunction with Hindsight Experience Replay to teach the control policy to move the cube to the desired x and y coordinates of the goal. Simultaneously, a dense distance-based reward is employed to teach the policy to lift the cube to the z coordinate (the height component) of the goal. The policy is trained in simulation with domain randomisation before being transferred to the real robot for evaluation. Although performance tends to worsen after this transfer, our best policy can successfully lift the real cube along goal trajectories via an effective pinching grasp. Our approach outperforms all other submissions, including those leveraging more traditional robotic control techniques, and is the first pure learning-based method to solve this challenge.
    On bounds for norms of reparameterized ReLU artificial neural network parameters: sums of fractional powers of the Lipschitz norm control the network parameter vector. (arXiv:2206.13646v1 [cs.LG])
    It is an elementary fact in the scientific literature that the Lipschitz norm of the realization function of a feedforward fully-connected rectified linear unit (ReLU) artificial neural network (ANN) can, up to a multiplicative constant, be bounded from above by sums of powers of the norm of the ANN parameter vector. Roughly speaking, in this work we reveal in the case of shallow ANNs that the converse inequality is also true. More formally, we prove that the norm of the equivalence class of ANN parameter vectors with the same realization function is, up to a multiplicative constant, bounded from above by the sum of powers of the Lipschitz norm of the ANN realization function (with the exponents $ 1/2 $ and $ 1 $). Moreover, we prove that this upper bound only holds when employing the Lipschitz norm but does neither hold for H\"older norms nor for Sobolev-Slobodeckij norms. Furthermore, we prove that this upper bound only holds for sums of powers of the Lipschitz norm with the exponents $ 1/2 $ and $ 1 $ but does not hold for the Lipschitz norm alone.
    Discrete Morse Sandwich: Fast Computation of Persistence Diagrams for Scalar Data -- An Algorithm and A Benchmark. (arXiv:2206.13932v1 [cs.LG])
    This paper introduces an efficient algorithm for persistence diagram computation, given an input piecewise linear scalar field f defined on a d-dimensional simplicial complex K, with $d \leq 3$. Our method extends the seminal "PairCells" algorithm by introducing three main accelerations. First, we express this algorithm within the setting of discrete Morse theory, which considerably reduces the number of input simplices to consider. Second, we introduce a stratification approach to the problem, that we call "sandwiching". Specifically, minima-saddle persistence pairs ($D_0(f)$) and saddle-maximum persistence pairs ($D_{d-1}(f)$) are efficiently computed by respectively processing with a Union-Find the unstable sets of 1-saddles and the stable sets of (d-1)-saddles. This fast processing of the dimensions 0 and (d-1) further reduces, and drastically, the number of critical simplices to consider for the computation of $D_1(f)$, the intermediate layer of the sandwich. Third, we document several performance improvements via shared-memory parallelism. We provide an open-source implementation of our algorithm for reproducibility purposes. We also contribute a reproducible benchmark package, which exploits three-dimensional data from a public repository and compares our algorithm to a variety of publicly available implementations. Extensive experiments indicate that our algorithm improves by two orders of magnitude the time performance of the seminal "PairCells" algorithm it extends. Moreover, it also improves memory footprint and time performance over a selection of 14 competing approaches, with a substantial gain over the fastest available approaches, while producing a strictly identical output. We illustrate the utility of our contributions with an application to the fast and robust extraction of persistent 1-dimensional generators on surfaces, volume data and high-dimensional point clouds.
    Verifiable Goal Recognition for Autonomous Driving with Occlusions. (arXiv:2206.14163v1 [cs.RO])
    When used in autonomous driving, goal recognition allows the future behaviour of other vehicles to be more accurately predicted. A recent goal recognition method for autonomous vehicles, GRIT, has been shown to be fast, accurate, interpretable and verifiable. In autonomous driving, vehicles can encounter novel scenarios that were unseen during training, and the environment is partially observable due to occlusions. However, GRIT can only operate in fixed frame scenarios, with full observability. We present a novel goal recognition method named Goal Recognition with Interpretable Trees under Occlusion (OGRIT), which solves these shortcomings of GRIT. We demonstrate that OGRIT can generalise between different scenarios and handle missing data due to occlusions, while still being fast, accurate, interpretable and verifiable.
    Efficient Algorithms For Fair Clustering with a New Fairness Notion. (arXiv:2109.00708v3 [cs.LG] UPDATED)
    We revisit the problem of fair clustering, first introduced by Chierichetti et al., that requires each protected attribute to have approximately equal representation in every cluster; i.e., a balance property. Existing solutions to fair clustering are either not scalable or do not achieve an optimal trade-off between clustering objective and fairness. In this paper, we propose a new notion of fairness, which we call $tau$-fair fairness, that strictly generalizes the balance property and enables a fine-grained efficiency vs. fairness trade-off. Furthermore, we show that simple greedy round-robin based algorithms achieve this trade-off efficiently. Under a more general setting of multi-valued protected attributes, we rigorously analyze the theoretical properties of the our algorithms. Our experimental results suggest that the proposed solution outperforms all the state-of-the-art algorithms and works exceptionally well even for a large number of clusters.
    Finite-sample analysis of identification of switched linear systems with arbitrary or restricted switching. (arXiv:2203.09862v2 [eess.SY] UPDATED)
    For the identification of switched systems with a measured switching signal, this work aims to analyze the effect of switching strategies on the estimation error. The data for identification is assumed to be collected from globally asymptotically or marginally stable switched systems under switches that are arbitrary or subject to an average dwell time constraint. Then the switched system is estimated by the least-squares (LS) estimator. To capture the effect of the parameters of the switching strategies on the LS estimation error, finite-sample error bounds are developed in this work. The obtained error bounds show that the estimation error is logarithmic of the switching parameters when there are only stable modes; however, when there are unstable modes, the estimation error bound can increase linearly as the switching parameter changes. This suggests that in the presence of unstable modes, the switching strategy should be properly designed to avoid the significant increase of the estimation error.
    Bellman Residual Orthogonalization for Offline Reinforcement Learning. (arXiv:2203.12786v2 [cs.LG] UPDATED)
    We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions. Focusing on applications to model-free offline RL with function approximation, we exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class. We prove an oracle inequality on our policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy. Different choices of test function spaces allow us to tackle different problems within a common framework. We characterize the loss of efficiency in moving from on-policy to off-policy data using our procedures, and establish connections to concentrability coefficients studied in past work. We examine in depth the implementation of our methods with linear function approximation, and provide theoretical guarantees with polynomial-time implementations even when Bellman closure does not hold.
    Epidemic Control Modeling using Parsimonious Models and Markov Decision Processes. (arXiv:2206.13910v1 [q-bio.PE])
    Many countries have experienced at least two waves of the COVID-19 pandemic. The second wave is far more dangerous as distinct strains appear more harmful to human health, but it stems from the complacency about the first wave. This paper introduces a parsimonious yet representative stochastic epidemic model that simulates the uncertain spread of the disease regardless of the latency and recovery time distributions. We also propose a Markov decision process to seek an optimal trade-off between the usage of the healthcare system and the economic costs of an epidemic. We apply the model to COVID-19 data from New Delhi, India and simulate the epidemic spread with different policy review times. The results show that the optimal policy acts swiftly to curb the epidemic in the first wave, thus avoiding the collapse of the healthcare system and the future costs of posterior outbreaks. An analysis of the recent collapse of the healthcare system of India during the second COVID-19 wave suggests that many lives could have been preserved if swift mitigation was promoted after the first wave.
    HyperNTF: A Hypergraph Regularized Nonnegative Tensor Factorization for Dimensionality Reduction. (arXiv:2101.06827v3 [cs.LG] UPDATED)
    Tensor decomposition is an effective tool for learning multi-way structures and heterogeneous features from high-dimensional data, such as the multi-view images and multichannel electroencephalography (EEG) signals, are often represented by tensors. However, most of tensor decomposition methods are the linear feature extraction techniques, which are unable to reveal the nonlinear structure within high-dimensional data. To address such problem, a lot of algorithms have been proposed for simultaneously performs linear and non-linear feature extraction. A representative algorithm is the Graph Regularized Non-negative Matrix Factorization (GNMF) for image clustering. However, the normal 2-order graph can only models the pairwise similarity of objects, which cannot sufficiently exploit the complex structures of samples. Thus, we propose a novel method, named Hypergraph Regularized Non-negative Tensor Factorization (HyperNTF), which utilizes hypergraph to encode the complex connections among samples and employs the factor matrix corresponding with last mode of Canonical Polyadic (CP) decomposition as low-dimensional representation. Extensive experiments on synthetic manifolds, real-world image datasets, and EEG signals, demonstrating that HyperNTF outperforms the state-of-the-art methods in terms of dimensionality reduction, clustering, and classification.
    SurvTRACE: Transformers for Survival Analysis with Competing Events. (arXiv:2110.00855v2 [cs.LG] UPDATED)
    In medicine, survival analysis studies the time duration to events of interest such as mortality. One major challenge is how to deal with multiple competing events (e.g., multiple disease diagnoses). In this work, we propose a transformer-based model that does not make the assumption for the underlying survival distribution and is capable of handling competing events, namely SurvTRACE. We account for the implicit \emph{confounders} in the observational setting in multi-events scenarios, which causes selection bias as the predicted survival probability is influenced by irrelevant factors. To sufficiently utilize the survival data to train transformers from scratch, multiple auxiliary tasks are designed for multi-task learning. The model hence learns a strong shared representation from all these tasks and in turn serves for better survival analysis. We further demonstrate how to inspect the covariate relevance and importance through interpretable attention mechanisms of SurvTRACE, which suffices to great potential in enhancing clinical trial design and new treatment development. Experiments on METABRIC, SUPPORT, and SEER data with 470k patients validate the all-around superiority of our method.
    Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection. (arXiv:2206.13979v1 [cs.SD])
    Audio DeepFakes allow the creation of high-quality, convincing utterances and therefore pose a threat due to its potential applications such as impersonation or fake news. Methods for detecting these manipulations should be characterized by good generalization and stability leading to robustness against attacks conducted with techniques that are not explicitly included in the training. In this work, we introduce Attack Agnostic Dataset - a combination of two audio DeepFakes and one anti-spoofing datasets that, thanks to the disjoint use of attacks, can lead to better generalization of detection methods. We present a thorough analysis of current DeepFake detection methods and consider different audio features (front-ends). In addition, we propose a model based on LCNN with LFCC and mel-spectrogram front-end, which not only is characterized by a good generalization and stability results but also shows improvement over LFCC-based mode - we decrease standard deviation on all folds and EER in two folds by up to 5%.
    mcBERT: Momentum Contrastive Learning with BERT for Zero-Shot Slot Filling. (arXiv:2203.12940v2 [cs.CL] UPDATED)
    Zero-shot slot filling has received considerable attention to cope with the problem of limited available data for the target domain. One of the important factors in zero-shot learning is to make the model learn generalized and reliable representations. For this purpose, we present mcBERT, which stands for momentum contrastive learning with BERT, to develop a robust zero-shot slot filling model. mcBERT uses BERT to initialize the two encoders, the query encoder and key encoder, and is trained by applying momentum contrastive learning. Our experimental results on the SNIPS benchmark show that mcBERT substantially outperforms the previous models, recording a new state-of-the-art. Besides, we also show that each component composing mcBERT contributes to the performance improvement.
    Understanding Gradient Descent on Edge of Stability in Deep Learning. (arXiv:2205.09745v2 [cs.LG] UPDATED)
    Deep learning experiments by Cohen et al. [2021] using deterministic Gradient Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and sharpness (i.e., the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) Normalized GD, i.e., GD with a varying LR $\eta_t =\frac{\eta}{|| \nabla L(x(t)) ||}$ and loss $L$; (2) GD with constant LR and loss $\sqrt{L- \min_x L(x)}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{1}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.
    Extracting Targeted Training Data from ASR Models, and How to Mitigate It. (arXiv:2204.08345v2 [cs.SD] UPDATED)
    Recent work has designed methods to demonstrate that model updates in ASR training can leak potentially sensitive attributes of the utterances used in computing the updates. In this work, we design the first method to demonstrate information leakage about training data from trained ASR models. We design Noise Masking, a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models. We demonstrate the success of Noise Masking by using it in four settings for extracting names from the LibriSpeech dataset used for training a state-of-the-art Conformer model. In particular, we show that we are able to extract the correct names from masked training utterances with 11.8% accuracy, while the model outputs some name from the train set 55.2% of the time. Further, we show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate). Lastly, we design Word Dropout, a data augmentation method that we show when used in training along with Multistyle TRaining (MTR), provides comparable utility as the baseline, along with significantly mitigating extraction via Noise Masking across the four evaluated settings.
    AutoInit: Automatic Initialization via Jacobian Tuning. (arXiv:2206.13568v1 [stat.ML])
    Good initialization is essential for training Deep Neural Networks (DNNs). Oftentimes such initialization is found through a trial and error approach, which has to be applied anew every time an architecture is substantially modified, or inherited from smaller size networks leading to sub-optimal initialization. In this work we introduce a new and cheap algorithm, that allows one to find a good initialization automatically, for general feed-forward DNNs. The algorithm utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality. We solve the dynamics of the algorithm for fully connected networks with ReLU and derive conditions for its convergence. We then extend the discussion to more general architectures with BatchNorm and residual connections. Finally, we apply our method to ResMLP and VGG architectures, where the automatic one-shot initialization found by our method shows good performance on vision tasks.
    Patch Selection for Melanoma Classification. (arXiv:2206.13626v1 [cs.CV])
    In medical image processing, the most important information is often located on small parts of the image. Patch-based approaches aim at using only the most relevant parts of the image. Finding ways to automatically select the patches is a challenge. In this paper, we investigate two criteria to choose patches: entropy and a spectral similarity criterion. We perform experiments at different levels of patch size. We train a Convolutional Neural Network on the subsets of patches and analyze the training time. We find that, in addition to requiring less preprocessing time, the classifiers trained on the datasets of patches selected based on entropy converge faster than on those selected based on the spectral similarity criterion and, furthermore, lead to higher accuracy. Moreover, patches of high entropy lead to faster convergence and better accuracy than patches of low entropy.
    Exploring linguistic feature and model combination for speech recognition based automatic AD detection. (arXiv:2206.13758v1 [cs.LG])
    Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and delay progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Scarcity of such specialist data leads to uncertainty in both model selection and feature learning when developing such systems. To this end, this paper investigates the use of feature and model combination approaches to improve the robustness of domain fine-tuning of BERT and Roberta pre-trained text encoders on limited data, before the resulting embedding features being fed into an ensemble of backend classifiers to produce the final AD detection decision via majority voting. Experiments conducted on the ADReSS20 Challenge dataset suggest consistent performance improvements were obtained using model and feature combination in system development. State-of-the-art AD detection accuracies of 91.67 percent and 93.75 percent were obtained using manual and ASR speech transcripts respectively on the ADReSS20 test set consisting of 48 elderly speakers.
    Fundamental Limits of Communication Efficiency for Model Aggregation in Distributed Learning: A Rate-Distortion Approach. (arXiv:2206.13984v1 [cs.IT])
    One of the main focuses in distributed learning is communication efficiency, since model aggregation at each round of training can consist of millions to billions of parameters. Several model compression methods, such as gradient quantization and sparsification, have been proposed to improve the communication efficiency of model aggregation. However, the information-theoretic minimum communication cost for a given distortion of gradient estimators is still unknown. In this paper, we study the fundamental limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective. By formulating the model aggregation as a vector Gaussian CEO problem, we derive the rate region bound and sum-rate-distortion function for the model aggregation problem, which reveals the minimum communication rate at a particular gradient distortion upper bound. We also analyze the communication cost at each iteration and total communication cost based on the sum-rate-distortion function with the gradient statistics of real-world datasets. It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD, and a high distortion of gradient estimator can achieve low total communication cost in gradient compression.
    Statistical inference with implicit SGD: proximal Robbins-Monro vs. Polyak-Ruppert. (arXiv:2206.12663v2 [stat.ML] UPDATED)
    The implicit stochastic gradient descent (ISGD), a proximal version of SGD, is gaining interest in the literature due to its stability over (explicit) SGD. In this paper, we conduct an in-depth analysis of the two modes of ISGD for smooth convex functions, namely proximal Robbins-Monro (proxRM) and proximal Poylak-Ruppert (proxPR) procedures, for their use in statistical inference on model parameters. Specifically, we derive non-asymptotic point estimation error bounds of both proxRM and proxPR iterates and their limiting distributions, and propose on-line estimators of their asymptotic covariance matrices that require only a single run of ISGD. The latter estimators are used to construct valid confidence intervals for the model parameters. Our analysis is free of the generalized linear model assumption that has limited the preceding analyses, and employs feasible procedures. Our on-line covariance matrix estimators appear to be the first of this kind in the ISGD literature.
    Multi-Agent Reinforcement Learning is a Sequence Modeling Problem. (arXiv:2205.14953v2 [cs.MA] UPDATED)
    Large sequence model (SM) such as GPT series and BERT has displayed outstanding performance and generalization capabilities on vision, language, and recently reinforcement learning tasks. A natural follow-up question is how to abstract multi-agent decision making into an SM problem and benefit from the prosperous development of SMs. In this paper, we introduce a novel architecture named Multi-Agent Transformer (MAT) that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents' observation sequence to agents' optimal action sequence. Our goal is to build the bridge between MARL and SMs so that the modeling power of modern sequence models can be unleashed for MARL. Central to our MAT is an encoder-decoder architecture which leverages the multi-agent advantage decomposition theorem to transform the joint policy search problem into a sequential decision making process; this renders only linear time complexity for multi-agent problems and, most importantly, endows MAT with monotonic performance improvement guarantee. Unlike prior arts such as Decision Transformer fit only pre-collected offline data, MAT is trained by online trials and errors from the environment in an on-policy fashion. To validate MAT, we conduct extensive experiments on StarCraftII, Multi-Agent MuJoCo, Dexterous Hands Manipulation, and Google Research Football benchmarks. Results demonstrate that MAT achieves superior performance and data efficiency compared to strong baselines including MAPPO and HAPPO. Furthermore, we demonstrate that MAT is an excellent few-short learner on unseen tasks regardless of changes in the number of agents. See our project page at https://sites.google.com/view/multi-agent-transformer.
    Domain Agnostic Few-shot Learning for Speaker Verification. (arXiv:2206.13700v1 [cs.SD])
    Deep learning models for verification systems often fail to generalize to new users and new environments, even though they learn highly discriminative features. To address this problem, we propose a few-shot domain generalization framework that learns to tackle distribution shift for new users and new domains. Our framework consists of domain-specific and domain-aggregation networks, which are the experts on specific and combined domains, respectively. By using these networks, we generate episodes that mimic the presence of both novel users and novel domains in the training phase to eventually produce better generalization. To save memory, we reduce the number of domain-specific networks by clustering similar domains together. Upon extensive evaluation on artificially generated noise domains, we can explicitly show generalization ability of our framework. In addition, we apply our proposed methods to the existing competitive architecture on the standard benchmark, which shows further performance improvements.
    On the amplification of security and privacy risks by post-hoc explanations in machine learning models. (arXiv:2206.14004v1 [cs.LG])
    A variety of explanation methods have been proposed in recent years to help users gain insights into the results returned by neural networks, which are otherwise complex and opaque black-boxes. However, explanations give rise to potential side-channels that can be leveraged by an adversary for mounting attacks on the system. In particular, post-hoc explanation methods that highlight input dimensions according to their importance or relevance to the result also leak information that weakens security and privacy. In this work, we perform the first systematic characterization of the privacy and security risks arising from various popular explanation techniques. First, we propose novel explanation-guided black-box evasion attacks that lead to 10 times reduction in query count for the same success rate. We show that the adversarial advantage from explanations can be quantified as a reduction in the total variance of the estimated gradient. Second, we revisit the membership information leaked by common explanations. Contrary to observations in prior studies, via our modified attacks we show significant leakage of membership information (above 100% improvement over prior results), even in a much stricter black-box setting. Finally, we study explanation-guided model extraction attacks and demonstrate adversarial gains through a large reduction in query count.
    Dynamic Memory for Interpretable Sequential Optimisation. (arXiv:2206.13960v1 [cs.LG])
    Real-world applications of reinforcement learning for recommendation and experimentation faces a practical challenge: the relative reward of different bandit arms can evolve over the lifetime of the learning agent. To deal with these non-stationary cases, the agent must forget some historical knowledge, as it may no longer be relevant to minimise regret. We present a solution to handling non-stationarity that is suitable for deployment at scale, to provide business operators with automated adaptive optimisation. Our solution aims to provide interpretable learning that can be trusted by humans, whilst responding to non-stationarity to minimise regret. To this end, we develop an adaptive Bayesian learning agent that employs a novel form of dynamic memory. It enables interpretability through statistical hypothesis testing, by targeting a set point of statistical power when comparing rewards and adjusting its memory dynamically to achieve this power. By design, the agent is agnostic to different kinds of non-stationarity. Using numerical simulations, we compare its performance against an existing proposal and show that, under multiple non-stationary scenarios, our agent correctly adapts to real changes in the true rewards. In all bandit solutions, there is an explicit trade-off between learning and achieving maximal performance. Our solution sits on a different point on this trade-off when compared to another similarly robust approach: we prioritise interpretability, which relies on more learning, at the cost of some regret. We describe the architecture of a large-scale deployment of automatic optimisation-as-a-service where our agent achieves interpretability whilst adapting to changing circumstances.
    Toward an ImageNet Library of Functions for Global Optimization Benchmarking. (arXiv:2206.13630v1 [cs.AI])
    Knowledge of search-landscape features of BlackBox Optimization (BBO) problems offers valuable information in light of the Algorithm Selection and/or Configuration problems. Exploratory Landscape Analysis (ELA) models have gained success in identifying predefined human-derived features and in facilitating portfolio selectors to address those challenges. Unlike ELA approaches, the current study proposes to transform the identification problem into an image recognition problem, with a potential to detect conception-free, machine-driven landscape features. To this end, we introduce the notion of Landscape Images, which enables us to generate imagery instances per a benchmark function, and then target the classification challenge over a diverse generalized dataset of functions. We address it as a supervised multi-class image recognition problem and apply basic artificial neural network models to solve it. The efficacy of our approach is numerically validated on the noise free BBOB and IOHprofiler benchmarking suites. This evident successful learning is another step toward automated feature extraction and local structure deduction of BBO problems. By using this definition of landscape images, and by capitalizing on existing capabilities of image recognition algorithms, we foresee the construction of an ImageNet-like library of functions for training generalized detectors that rely on machine-driven features.
    Envelope imbalanced ensemble model with deep sample learning and local-global structure consistency. (arXiv:2206.13507v1 [cs.LG])
    The class imbalance problem is important and challenging. Ensemble approaches are widely used to tackle this problem because of their effectiveness. However, existing ensemble methods are always applied into original samples, while not considering the structure information among original samples. The limitation will prevent the imbalanced learning from being better. Besides, research shows that the structure information among samples includes local and global structure information. Based on the analysis above, an imbalanced ensemble algorithm with the deep sample pre-envelope network (DSEN) and local-global structure consistency mechanism (LGSCM) is proposed here to solve the problem.This algorithm can guarantee high-quality deep envelope samples for considering the local manifold and global structures information, which is helpful for imbalance learning. First, the deep sample envelope pre-network (DSEN) is designed to mine structure information among samples.Then, the local manifold structure metric (LMSM) and global structure distribution metric (GSDM) are designed to construct LGSCM to enhance distribution consistency of interlayer samples. Next, the DSEN and LGSCM are put together to form the final deep sample envelope network (DSEN-LG). After that, base classifiers are applied on the layers of deep samples respectively.Finally, the predictive results from base classifiers are fused through bagging ensemble learning mechanism. To demonstrate the effectiveness of the proposed method, forty-four public datasets and more than ten representative relevant algorithms are chosen for verification. The experimental results show that the algorithm is significantly better than other imbalanced ensemble algorithms.
    SLOVA: Uncertainty Estimation Using Single Label One-Vs-All Classifier. (arXiv:2206.13923v1 [cs.LG])
    Deep neural networks present impressive performance, yet they cannot reliably estimate their predictive confidence, limiting their applicability in high-risk domains. We show that applying a multi-label one-vs-all loss reveals classification ambiguity and reduces model overconfidence. The introduced SLOVA (Single Label One-Vs-All) model redefines typical one-vs-all predictive probabilities to a single label situation, where only one class is the correct answer. The proposed classifier is confident only if a single class has a high probability and other probabilities are negligible. Unlike the typical softmax function, SLOVA naturally detects out-of-distribution samples if the probabilities of all other classes are small. The model is additionally fine-tuned with exponential calibration, which allows us to precisely align the confidence score with model accuracy. We verify our approach on three tasks. First, we demonstrate that SLOVA is competitive with the state-of-the-art on in-distribution calibration. Second, the performance of SLOVA is robust under dataset shifts. Finally, our approach performs extremely well in the detection of out-of-distribution samples. Consequently, SLOVA is a tool that can be used in various applications where uncertainty modeling is required.
    Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee. (arXiv:2206.10477v2 [cs.LG] UPDATED)
    Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test-time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On three standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive with the best of baselines tested in terms of concordance index. Our code is available at: https://github.com/georgehc/survival-kernets
    Sublinear-Time Clustering Oracle for Signed Graphs. (arXiv:2206.13813v1 [cs.DS])
    Social networks are often modeled using signed graphs, where vertices correspond to users and edges have a sign that indicates whether an interaction between users was positive or negative. The arising signed graphs typically contain a clear community structure in the sense that the graph can be partitioned into a small number of polarized communities, each defining a sparse cut and indivisible into smaller polarized sub-communities. We provide a local clustering oracle for signed graphs with such a clear community structure, that can answer membership queries, i.e., "Given a vertex $v$, which community does $v$ belong to?", in sublinear time by reading only a small portion of the graph. Formally, when the graph has bounded maximum degree and the number of communities is at most $O(\log n)$, then with $\tilde{O}(\sqrt{n}\operatorname{poly}(1/\varepsilon))$ preprocessing time, our oracle can answer each membership query in $\tilde{O}(\sqrt{n}\operatorname{poly}(1/\varepsilon))$ time, and it correctly classifies a $(1-\varepsilon)$-fraction of vertices w.r.t. a set of hidden planted ground-truth communities. Our oracle is desirable in applications where the clustering information is needed for only a small number of vertices. Previously, such local clustering oracles were only known for unsigned graphs; our generalization to signed graphs requires a number of new ideas and gives a novel spectral analysis of the behavior of random walks with signs. We evaluate our algorithm for constructing such an oracle and answering membership queries on both synthetic and real-world datasets, validating its performance in practice.
    Short-Term Plasticity Neurons Learning to Learn and Forget. (arXiv:2206.14048v1 [cs.NE])
    Short-term plasticity (STP) is a mechanism that stores decaying memories in synapses of the cerebral cortex. In computing practice, STP has been used, but mostly in the niche of spiking neurons, even though theory predicts that it is the optimal solution to certain dynamic tasks. Here we present a new type of recurrent neural unit, the STP Neuron (STPN), which indeed turns out strikingly powerful. Its key mechanism is that synapses have a state, propagated through time by a self-recurrent connection-within-the-synapse. This formulation enables training the plasticity with backpropagation through time, resulting in a form of learning to learn and forget in the short term. The STPN outperforms all tested alternatives, i.e. RNNs, LSTMs, other models with fast weights, and differentiable plasticity. We confirm this in both supervised and reinforcement learning (RL), and in tasks such as Associative Retrieval, Maze Exploration, Atari video games, and MuJoCo robotics. Moreover, we calculate that, in neuromorphic or biological circuits, the STPN minimizes energy consumption across models, as it depresses individual synapses dynamically. Based on these, biological STP may have been a strong evolutionary attractor that maximizes both efficiency and computational power. The STPN now brings these neuromorphic advantages also to a broad spectrum of machine learning practice. Code is available at https://github.com/NeuromorphicComputing/stpn
    Robustifying Vision Transformer without Retraining from Scratch by Test-Time Class-Conditional Feature Alignment. (arXiv:2206.13951v1 [cs.CV])
    Vision Transformer (ViT) is becoming more popular in image processing. Specifically, we investigate the effectiveness of test-time adaptation (TTA) on ViT, a technique that has emerged to correct its prediction during test-time by itself. First, we benchmark various test-time adaptation approaches on ViT-B16 and ViT-L16. It is shown that the TTA is effective on ViT and the prior-convention (sensibly selecting modulation parameters) is not necessary when using proper loss function. Based on the observation, we propose a new test-time adaptation method called class-conditional feature alignment (CFA), which minimizes both the class-conditional distribution differences and the whole distribution differences of the hidden representation between the source and target in an online manner. Experiments of image classification tasks on common corruption (CIFAR-10-C, CIFAR-100-C, and ImageNet-C) and domain adaptation (digits datasets and ImageNet-Sketch) show that CFA stably outperforms the existing baselines on various datasets. We also verify that CFA is model agnostic by experimenting on ResNet, MLP-Mixer, and several ViT variants (ViT-AugReg, DeiT, and BeiT). Using BeiT backbone, CFA achieves 19.8% top-1 error rate on ImageNet-C, outperforming the existing test-time adaptation baseline 44.0%. This is a state-of-the-art result among TTA methods that do not need to alter training phase.
    Learning Generalizable Dexterous Manipulation from Human Grasp Affordance. (arXiv:2204.02320v3 [cs.RO] UPDATED)
    Dexterous manipulation with a multi-finger hand is one of the most challenging problems in robotics. While recent progress in imitation learning has largely improved the sample efficiency compared to Reinforcement Learning, the learned policy can hardly generalize to manipulate novel objects, given limited expert demonstrations. In this paper, we propose to learn dexterous manipulation using large-scale demonstrations with diverse 3D objects in a category, which are generated from a human grasp affordance model. This generalizes the policy to novel object instances within the same category. To train the policy, we propose a novel imitation learning objective jointly with a geometric representation learning objective using our demonstrations. By experimenting with relocating diverse objects in simulation, we show that our approach outperforms baselines with a large margin when manipulating novel objects. We also ablate the importance on 3D object representation learning for manipulation. We include videos, code, and additional information on the project website - https://kristery.github.io/ILAD/ .
    Evaluating Understanding on Conceptual Abstraction Benchmarks. (arXiv:2206.14187v1 [cs.AI])
    A long-held objective in AI is to build systems that understand concepts in a humanlike way. Setting aside the difficulty of building such a system, even trying to evaluate one is a challenge, due to present-day AI's relative opacity and its proclivity for finding shortcut solutions. This is exacerbated by humans' tendency to anthropomorphize, assuming that a system that can recognize one instance of a concept must also understand other instances, as a human would. In this paper, we argue that understanding a concept requires the ability to use it in varied contexts. Accordingly, we propose systematic evaluations centered around concepts, by probing a system's ability to use a given concept in many different instantiations. We present case studies of such an evaluations on two domains -- RAVEN (inspired by Raven's Progressive Matrices) and the Abstraction and Reasoning Corpus (ARC) -- that have been used to develop and assess abstraction abilities in AI systems. Our concept-based approach to evaluation reveals information about AI systems that conventional test sets would have left hidden.
    Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs. (arXiv:2206.13787v1 [cs.LG])
    Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving tabular data. DP-CGANS distinguishes categorical and continuous variables and transforms them to latent space separately. Then, we structure a conditional vector as an additional input to not only presents the minority class in the imbalanced data, but also capture the dependency between variables. We inject statistical noise to the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing dependency between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structure and characteristics of real-world datasets such as imbalance variables, abnormal distributions, and sparsity of data.
    Nonparametric, Nonasymptotic Confidence Bands with Paley-Wiener Kernels for Band-Limited Functions. (arXiv:2206.13629v1 [stat.ML])
    The paper introduces a method to construct confidence bands for bounded, band-limited functions based on a finite sample of input-output pairs. The approach is distribution-free w.r.t. the observation noises and only the knowledge of the input distribution is assumed. It is nonparametric, that is, it does not require a parametric model of the regression function and the regions have non-asymptotic guarantees. The algorithm is based on the theory of Paley-Wiener reproducing kernel Hilbert spaces. The paper first studies the fully observable variant, when there are no noises on the observations and only the inputs are random; then it generalizes the ideas to the noisy case using gradient-perturbation methods. Finally, numerical experiments demonstrating both cases are presented.
    Detecting Unintended Memorization in Language-Model-Fused ASR. (arXiv:2204.09606v2 [cs.CL] UPDATED)
    End-to-end (E2E) models are often being accompanied by language models (LMs) via shallow fusion for boosting their overall quality as well as recognition of rare words. At the same time, several prior works show that LMs are susceptible to unintentionally memorizing rare or unique sequences in the training data. In this work, we design a framework for detecting memorization of random textual sequences (which we call canaries) in the LM training data when one has only black-box (query) access to LM-fused speech recognizer, as opposed to direct access to the LM. On a production-grade Conformer RNN-T E2E model fused with a Transformer LM, we show that detecting memorization of singly-occurring canaries from the LM training data of 300M examples is possible. Motivated to protect privacy, we also show that such memorization gets significantly reduced by per-example gradient-clipped LM training without compromising overall quality.
    Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse. (arXiv:2206.13714v1 [cs.LG])
    Real-world sequential decision making requires data-driven algorithms that provide practical guarantees on performance throughout training while also making efficient use of data. Model-free deep reinforcement learning represents a framework for such data-driven decision making, but existing algorithms typically only focus on one of these goals while sacrificing performance with respect to the other. On-policy algorithms guarantee policy improvement throughout training but suffer from high sample complexity, while off-policy algorithms make efficient use of data through sample reuse but lack theoretical guarantees. In order to balance these competing goals, we develop a class of Generalized Policy Improvement algorithms that combines the policy improvement guarantees of on-policy methods with the efficiency of theoretically supported sample reuse. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a variety of continuous control tasks from the DeepMind Control Suite.
    H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture. (arXiv:2206.13734v1 [cs.AR])
    Graph Neural Networks (GNNs) have drawn tremendous attention due to their unique capability to extend Machine Learning (ML) approaches to applications broadly-defined as having unstructured data, especially graphs. Compared with other Machine Learning (ML) modalities, the acceleration of Graph Neural Networks (GNNs) is more challenging due to the irregularity and heterogeneity derived from graph typologies. Existing efforts, however, have focused mainly on handling graphs' irregularity and have not studied their heterogeneity. To this end we propose H-GCN, a PL (Programmable Logic) and AIE (AI Engine) based hybrid accelerator that leverages the emerging heterogeneity of Xilinx Versal Adaptive Compute Acceleration Platforms (ACAPs) to achieve high-performance GNN inference. In particular, H-GCN partitions each graph into three subgraphs based on its inherent heterogeneity, and processes them using PL and AIE, respectively. To further improve performance, we explore the sparsity support of AIE and develop an efficient density-aware method to automatically map tiles of sparse matrix-matrix multiplication (SpMM) onto the systolic tensor array. Compared with state-of-the-art GCN accelerators, H-GCN achieves, on average, speedups of 1.1~2.3X.  ( 2 min )
    Classification of ADHD Patients Using Kernel Hierarchical Extreme Learning Machine. (arXiv:2206.13761v1 [cs.LG])
    Recently, the application of deep learning models to diagnose neuropsychiatric diseases from brain imaging data has received more and more attention. However, in practice, exploring interactions in brain functional connectivity based on operational magnetic resonance imaging data is critical for studying mental illness. Since Attention-Deficit and Hyperactivity Disorder (ADHD) is a type of chronic disease that is very difficult to diagnose in the early stages, it is necessary to improve the diagnosis accuracy of such illness using machine learning models treating patients before the critical condition. In this study, we utilize the dynamics of brain functional connectivity to model features from medical imaging data, which can extract the differences in brain function interactions between Normal Control (NC) and ADHD. To meet that requirement, we employ the Bayesian connectivity change-point model to detect brain dynamics using the local binary encoding approach and kernel hierarchical extreme learning machine for classifying features. To verify our model, we experimented with it on several real-world children's datasets, and our results achieved superior classification rates compared to the state-of-the-art models.  ( 2 min )
    Rankings from multimodal pairwise comparisons. (arXiv:2206.13580v1 [stat.ML])
    The task of ranking individuals or teams, based on a set of comparisons between pairs, arises in various contexts, including sporting competitions and the analysis of dominance hierarchies among animals and humans. Given data on which competitors beat which others, the challenge is to rank the competitors from best to worst. Here we study the problem of computing rankings when there are multiple, potentially conflicting modes of comparison, such as multiple types of dominance behaviors among animals. We assume that we do not know a priori what information each behavior conveys about the ranking, or even whether they convey any information at all. Nonetheless we show that it is possible to compute a ranking in this situation and present a fast method for doing so, based on a combination of an expectation-maximization algorithm and a modified Bradley-Terry model. We give a selection of example applications to both animal and human competition.  ( 2 min )
    Heterogeneous mixtures of dictionary functions to approximate subspace invariance in Koopman operators. (arXiv:2206.13585v1 [eess.SY])
    Koopman operators model nonlinear dynamics as a linear dynamic system acting on a nonlinear function as the state. This nonstandard state is often called a Koopman observable and is usually approximated numerically by a superposition of functions drawn from a \textit{dictionary}. A widely used algorithm, is \textit{Extended Dynamic Mode Decomposition}, where the dictionary functions are drawn from a fixed, homogeneous class of functions. Recently, deep learning combined with EDMD has been used to learn novel dictionary functions in an algorithm called deep dynamic mode decomposition (deepDMD). The learned representation both (1) accurately models and (2) scales well with the dimension of the original nonlinear system. In this paper we analyze the learned dictionaries from deepDMD and explore the theoretical basis for their strong performance. We discover a novel class of dictionary functions to approximate Koopman observables. Error analysis of these dictionary functions show they satisfy a property of subspace approximation, which we define as uniform finite approximate closure. We discover that structured mixing of heterogeneous dictionary functions drawn from different classes of nonlinear functions achieve the same accuracy and dimensional scaling as deepDMD. This mixed dictionary does so with an order of magnitude reduction in parameters, while maintaining geometric interpretability. Our results provide a hypothesis to explain the success of deep neural networks in learning numerical approximations to Koopman operators.  ( 3 min )
    DistSPECTRL: Distributing Specifications in Multi-Agent Reinforcement Learning Systems. (arXiv:2206.13754v1 [cs.MA])
    While notable progress has been made in specifying and learning objectives for general cyber-physical systems, applying these methods to distributed multi-agent systems still pose significant challenges. Among these are the need to (a) craft specification primitives that allow expression and interplay of both local and global objectives, (b) tame explosion in the state and action spaces to enable effective learning, and (c) minimize coordination frequency and the set of engaged participants for global objectives. To address these challenges, we propose a novel specification framework that allows natural composition of local and global objectives used to guide training of a multi-agent system. Our technique enables learning expressive policies that allow agents to operate in a coordination-free manner for local objectives, while using a decentralized communication protocol for enforcing global ones. Experimental results support our claim that sophisticated multi-agent distributed planning problems can be effectively realized using specification-guided learning.  ( 2 min )
    Attention-based conditioning methods using variable frame rate for style-robust speaker verification. (arXiv:2206.13680v1 [eess.AS])
    We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using the bottleneck features as speaker representations. Such a network has a pooling layer to transform frame-level to utterance-level features by calculating statistics over all utterance frames, with equal weighting. However, self-attentive embeddings perform weighted pooling such that the weights correspond to the importance of the frames in a speaker classification task. Entropy can capture acoustic variability due to speaking style variations. Hence, an entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer to provide the network with information that can address style effects. This work explores five different approaches to conditioning. The best conditioning approach, concatenation with gating, provided statistically significant improvements over the x-vector baseline in 12/23 tasks and was the same as the baseline in 11/23 tasks when using the UCLA speaker variability database. It also significantly outperformed self-attention without conditioning in 9/23 tasks and was worse in 1/23. The method also showed significant improvements in multi-speaker scenarios of SITW.  ( 2 min )
    ProGen2: Exploring the Boundaries of Protein Language Models. (arXiv:2206.13517v1 [cs.LG])
    Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional finetuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. We release the ProGen2 models and code at https://github.com/salesforce/progen.  ( 2 min )
    TTS-CGAN: A Transformer Time-Series Conditional GAN for Biosignal Data Augmentation. (arXiv:2206.13676v1 [cs.LG])
    Signal measurement appearing in the form of time series is one of the most common types of data used in medical machine learning applications. Such datasets are often small in size, expensive to collect and annotate, and might involve privacy issues, which hinders our ability to train large, state-of-the-art deep learning models for biomedical applications. For time-series data, the suite of data augmentation strategies we can use to expand the size of the dataset is limited by the need to maintain the basic properties of the signal. Generative Adversarial Networks (GANs) can be utilized as another data augmentation tool. In this paper, we present TTS-CGAN, a transformer-based conditional GAN model that can be trained on existing multi-class datasets and generate class-specific synthetic time-series sequences of arbitrary length. We elaborate on the model architecture and design strategies. Synthetic sequences generated by our model are indistinguishable from real ones, and can be used to complement or replace real signals of the same type, thus achieving the goal of data augmentation. To evaluate the quality of the generated data, we modify the wavelet coherence metric to be able to compare the similarity between two sets of signals, and also conduct a case study where a mix of synthetic and real data are used to train a deep learning model for sequence classification. Together with other visualization techniques and qualitative evaluation approaches, we demonstrate that TTS-CGAN generated synthetic data are similar to real data, and that our model performs better than the other state-of-the-art GAN models built for time-series data generation.  ( 3 min )
    Online Resource Allocation under Horizon Uncertainty. (arXiv:2206.13606v1 [cs.DS])
    We study stochastic online resource allocation: a decision maker needs to allocate limited resources to stochastically-generated sequentially-arriving requests in order to maximize reward. Motivated by practice, we consider a data-driven setting in which requests are drawn independently from a distribution that is unknown to the decision maker. Online resource allocation and its special cases have been studied extensively in the past, but these previous results crucially and universally rely on a practically-untenable assumption: the total number of requests (the horizon) is known to the decision maker in advance. In many applications, such as revenue management and online advertising, the number of requests can vary widely because of fluctuations in demand or user traffic intensity. In this work, we develop online algorithms that are robust to horizon uncertainty. In sharp contrast to the known-horizon setting, we show that no algorithm can achieve a constant asymptotic competitive ratio that is independent of the horizon uncertainty. We then introduce a novel algorithm that combines dual mirror descent with a carefully-chosen target consumption sequence and prove that it achieves a bounded competitive ratio. Our algorithm is near-optimal in the sense that its competitive ratio attains the optimal rate of growth when the horizon uncertainty grows large.  ( 2 min )
  • Open

    Improved Certified Defenses against Data Poisoning with (Deterministic) Finite Aggregation. (arXiv:2202.02628v2 [cs.LG] UPDATED)
    Data poisoning attacks aim at manipulating model behaviors through distorting training data. Previously, an aggregation-based certified defense, Deep Partition Aggregation (DPA), was proposed to mitigate this threat. DPA predicts through an aggregation of base classifiers trained on disjoint subsets of data, thus restricting its sensitivity to dataset distortions. In this work, we propose an improved certified defense against general poisoning attacks, namely Finite Aggregation. In contrast to DPA, which directly splits the training set into disjoint subsets, our method first splits the training set into smaller disjoint subsets and then combines duplicates of them to build larger (but not disjoint) subsets for training base classifiers. This reduces the worst-case impacts of poison samples and thus improves certified robustness bounds. In addition, we offer an alternative view of our method, bridging the designs of deterministic and stochastic aggregation-based certified defenses. Empirically, our proposed Finite Aggregation consistently improves certificates on MNIST, CIFAR-10, and GTSRB, boosting certified fractions by up to 3.05%, 3.87% and 4.77%, respectively, while keeping the same clean accuracies as DPA's, effectively establishing a new state of the art in (pointwise) certified robustness against data poisoning.
    Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee. (arXiv:2206.10477v2 [cs.LG] UPDATED)
    Kernel survival analysis models estimate individual survival distributions with the help of a kernel function, which measures the similarity between any two data points. Such a kernel function can be learned using deep kernel survival models. In this paper, we present a new deep kernel survival model called a survival kernet, which scales to large datasets in a manner that is amenable to model interpretation and also theoretical analysis. Specifically, the training data are partitioned into clusters based on a recently developed training set compression scheme for classification and regression called kernel netting that we extend to the survival analysis setting. At test-time, each data point is represented as a weighted combination of these clusters, and each such cluster can be visualized. For a special case of survival kernets, we establish a finite-sample error bound on predicted survival distributions that is, up to a log factor, optimal. Whereas scalability at test time is achieved using the aforementioned kernel netting compression strategy, scalability during training is achieved by a warm-start procedure based on tree ensembles such as XGBoost and a heuristic approach to accelerating neural architecture search. On three standard survival analysis datasets of varying sizes (up to roughly 3 million data points), we show that survival kernets are highly competitive with the best of baselines tested in terms of concordance index. Our code is available at: https://github.com/georgehc/survival-kernets
    Fundamental Limits of Communication Efficiency for Model Aggregation in Distributed Learning: A Rate-Distortion Approach. (arXiv:2206.13984v1 [cs.IT])
    One of the main focuses in distributed learning is communication efficiency, since model aggregation at each round of training can consist of millions to billions of parameters. Several model compression methods, such as gradient quantization and sparsification, have been proposed to improve the communication efficiency of model aggregation. However, the information-theoretic minimum communication cost for a given distortion of gradient estimators is still unknown. In this paper, we study the fundamental limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective. By formulating the model aggregation as a vector Gaussian CEO problem, we derive the rate region bound and sum-rate-distortion function for the model aggregation problem, which reveals the minimum communication rate at a particular gradient distortion upper bound. We also analyze the communication cost at each iteration and total communication cost based on the sum-rate-distortion function with the gradient statistics of real-world datasets. It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD, and a high distortion of gradient estimator can achieve low total communication cost in gradient compression.
    SurvTRACE: Transformers for Survival Analysis with Competing Events. (arXiv:2110.00855v2 [cs.LG] UPDATED)
    In medicine, survival analysis studies the time duration to events of interest such as mortality. One major challenge is how to deal with multiple competing events (e.g., multiple disease diagnoses). In this work, we propose a transformer-based model that does not make the assumption for the underlying survival distribution and is capable of handling competing events, namely SurvTRACE. We account for the implicit \emph{confounders} in the observational setting in multi-events scenarios, which causes selection bias as the predicted survival probability is influenced by irrelevant factors. To sufficiently utilize the survival data to train transformers from scratch, multiple auxiliary tasks are designed for multi-task learning. The model hence learns a strong shared representation from all these tasks and in turn serves for better survival analysis. We further demonstrate how to inspect the covariate relevance and importance through interpretable attention mechanisms of SurvTRACE, which suffices to great potential in enhancing clinical trial design and new treatment development. Experiments on METABRIC, SUPPORT, and SEER data with 470k patients validate the all-around superiority of our method.
    Equivariant Priors for Compressed Sensing with Unknown Orientation. (arXiv:2206.14069v1 [cs.LG])
    In compressed sensing, the goal is to reconstruct the signal from an underdetermined system of linear measurements. Thus, prior knowledge about the signal of interest and its structure is required. Additionally, in many scenarios, the signal has an unknown orientation prior to measurements. To address such recovery problems, we propose using equivariant generative models as a prior, which encapsulate orientation information in their latent space. Thereby, we show that signals with unknown orientations can be recovered with iterative gradient descent on the latent space of these models and provide additional theoretical recovery guarantees. We construct an equivariant variational autoencoder and use the decoder as generative prior for compressed sensing. We discuss additional potential gains of the proposed approach in terms of convergence and latency.
    Neural Tangent Kernel Analysis of Deep Narrow Neural Networks. (arXiv:2202.02981v2 [cs.LG] UPDATED)
    The tremendous recent progress in analyzing the training dynamics of overparameterized neural networks has primarily focused on wide networks and therefore does not sufficiently address the role of depth in deep learning. In this work, we present the first trainability guarantee of infinitely deep but narrow neural networks. We study the infinite-depth limit of a multilayer perceptron (MLP) with a specific initialization and establish a trainability guarantee using the NTK theory. We then extend the analysis to an infinitely deep convolutional neural network (CNN) and perform brief experiments.
    Exact Spectral Norm Regularization for Neural Networks. (arXiv:2206.13581v1 [stat.ML])
    We pursue a line of research that seeks to regularize the spectral norm of the Jacobian of the input-output mapping for deep neural networks. While previous work rely on upper bounding techniques, we provide a scheme that targets the exact spectral norm. We showcase that our algorithm achieves an improved generalization performance compared to previous spectral regularization techniques while simultaneously maintaining a strong safeguard against natural and adversarial noise. Moreover, we further explore some previous reasoning concerning the strong adversarial protection that Jacobian regularization provides and show that it can be misleading.
    Entropy-based Characterization of Modeling Constraints. (arXiv:2206.14105v1 [stat.ME])
    In most data-scientific approaches, the principle of Maximum Entropy (MaxEnt) is used to a posteriori justify some parametric model which has been already chosen based on experience, prior knowledge or computational simplicity. In a perpendicular formulation to conventional model building, we start from the linear system of phenomenological constraints and asymptotically derive the distribution over all viable distributions that satisfy the provided set of constraints. The MaxEnt distribution plays a special role, as it is the most typical among all phenomenologically viable distributions representing a good expansion point for large-N techniques. This enables us to consistently formulate hypothesis testing in a fully-data driven manner. The appropriate parametric model which is supported by the data can be always deduced at the end of model selection. In the MaxEnt framework, we recover major scores and selection procedures used in multiple applications and assess their ability to capture associations in the data-generating process and identify the most generalizable model. This data-driven counterpart of standard model selection demonstrates the unifying prospective of the deductive logic advocated by MaxEnt principle, while potentially shedding new insights to the inverse problem.
    Disentangling Embedding Spaces with Minimal Distributional Assumptions. (arXiv:2206.13872v1 [stat.ML])
    Interest in understanding and factorizing learned embedding spaces is growing. For instance, recent concept-based explanation techniques analyze a machine learning model in terms of interpretable latent components. Such components have to be discovered in the model's embedding space, e.g., through independent component analysis (ICA) or modern disentanglement learning techniques. While these unsupervised approaches offer a sound formal framework, they either require access to a data generating function or impose rigid assumptions on the data distribution, such as independence of components, that are often violated in practice. In this work, we link conceptual explainability for vision models with disentanglement learning and ICA. This enables us to provide first theoretical results on how components can be identified without requiring any distributional assumptions. From these insights, we derive the disjoint attributions (DA) concept discovery method that is applicable to a broader class of problems than current approaches but yet possesses a formal identifiability guarantee. In an extensive comparison against component analysis and over 300 state-of-the-art disentanglement models, DA stably maintains superior performance, even under varying distributions and correlation strengths.
    Detecting Distributional Differences in Labeled Sequence Data with Application to Tropical Cyclone Satellite Imagery. (arXiv:2202.02253v3 [stat.AP] UPDATED)
    Our goal is to quantify whether, and if so how, spatio-temporal patterns in tropical cyclone (TC) satellite imagery signal an upcoming rapid intensity change event. To address this question, we propose a new nonparametric test of association between a time series of images and a series of binary event labels. We ask whether there is a difference in distribution between (dependent but identically distributed) 24-h sequences of images preceding an event versus a non-event. By rewriting the statistical test as a regression problem, we leverage neural networks to infer modes of structural evolution of TC convection that are representative of the lead-up to rapid intensity change events. Dependencies between nearby sequences are handled by a bootstrap procedure that estimates the marginal distribution of the label series. We prove that type I error control is guaranteed as long as the distribution of the label series is well-estimated, which is made easier by the extensive historical data for binary TC event labels. We show empirical evidence that our proposed method identifies archetypes of infrared imagery associated with elevated rapid intensification risk, typically marked by deep or deepening core convection over time. Such results provide a foundation for improved forecasts of rapid intensification.
    Stochastic first-order methods for average-reward Markov decision processes. (arXiv:2205.05800v4 [cs.LG] UPDATED)
    We study the problem of average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy evaluation and optimization. Existing on-policy evaluation methods suffer from sub-optimal convergence rates as well as failure in handling insufficiently random policies, e.g., deterministic policies, for lack of exploration. To remedy these issues, we develop a novel variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and an exploratory variance-reduced temporal difference (EVRTD) method for insufficiently random policies with comparable convergence guarantees. We further establish linear convergence rate on the bias of policy evaluation, which is essential for improving the overall sample complexity of policy optimization. On the other hand, compared with intensive research interest in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions on the underlying Markov processes (see, e.g., Abbasi-Yadkori et al., 2019), and they often lack guarantees on the overall sample complexities. Towards this end, we develop an average-reward variant of the stochastic policy mirror descent (SPMD) (Lan, 2022). We establish the first $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity for solving AMDPs with policy gradient method under both the generative model (with unichain assumption) and Markovian noise model (with ergodic assumption). This bound can be further improved to $\widetilde{\mathcal{O}}(\epsilon^{-1})$ for solving regularized AMDPs. Our theoretical advantages are corroborated by numerical experiments.
    A Global Stochastic Optimization Particle Filter Algorithm. (arXiv:2007.04803v8 [stat.ML] UPDATED)
    We introduce a new online algorithm for expected log-likelihood maximization in situations where the objective function is multi-modal and/or has saddle points, that we term G-PFSO. The key element underpinning G-PFSO is a probability distribution which (a) is shown to concentrate on the target parameter value as the sample size increases and (b) can be efficiently estimated by means of a standard particle filter algorithm. This distribution depends on a learning rate, where the faster the learning rate the quicker it concentrates on the desired element of the search space, but the less likely G-PFSO is to escape from a local optimum of the objective function. In order to achieve a fast convergence rate with a slow learning rate, G-PFSO exploits the acceleration property of averaging, well-known in the stochastic gradient literature. Considering several challenging estimation problems, the numerical experiments show that, with high probability, G-PFSO successfully finds the highest mode of the objective function and converges to its global maximizer at the optimal rate. While the focus of this work is expected log-likelihood maximization, the proposed methodology and its theory apply more generally for optimizing a function defined through an expectation.
    Offline Reinforcement Learning with Realizability and Single-policy Concentrability. (arXiv:2202.04634v3 [cs.LG] UPDATED)
    Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.
    TACTiS: Transformer-Attentional Copulas for Time Series. (arXiv:2202.03528v2 [cs.LG] UPDATED)
    The estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance. However, the practical utility of such estimates is limited by how accurately they quantify predictive uncertainty. In this work, we address the problem of estimating the joint predictive distribution of high-dimensional multivariate time series. We propose a versatile method, based on the transformer architecture, that estimates joint distributions using an attention-based decoder that provably learns to mimic the properties of non-parametric copulas. The resulting model has several desirable properties: it can scale to hundreds of time series, supports both forecasting and interpolation, can handle unaligned and non-uniformly sampled data, and can seamlessly adapt to missing data during training. We demonstrate these properties empirically and show that our model produces state-of-the-art predictions on multiple real-world datasets.
    Measure Estimation in the Barycentric Coding Model. (arXiv:2201.12195v2 [stat.ML] UPDATED)
    This paper considers the problem of measure estimation under the barycentric coding model (BCM), in which an unknown measure is assumed to belong to the set of Wasserstein-2 barycenters of a finite set of known measures. Estimating a measure under this model is equivalent to estimating the unknown barycentric coordinates. We provide novel geometrical, statistical, and computational insights for measure estimation under the BCM, consisting of three main results. Our first main result leverages the Riemannian geometry of Wasserstein-2 space to provide a procedure for recovering the barycentric coordinates as the solution to a quadratic optimization problem assuming access to the true reference measures. The essential geometric insight is that the parameters of this quadratic problem are determined by inner products between the optimal displacement maps from the given measure to the reference measures defining the BCM. Our second main result then establishes an algorithm for solving for the coordinates in the BCM when all the measures are observed empirically via i.i.d. samples. We prove precise rates of convergence for this algorithm -- determined by the smoothness of the underlying measures and their dimensionality -- thereby guaranteeing its statistical consistency. Finally, we demonstrate the utility of the BCM and associated estimation procedures in three application areas: (i) covariance estimation for Gaussian measures; (ii) image processing; and (iii) natural language processing.
    On the universality of the volatility formation process: when machine learning and rough volatility agree. (arXiv:2206.14114v1 [q-fin.ST])
    We train an LSTM network based on a pooled dataset made of hundreds of liquid stocks aiming to forecast the next daily realized volatility for all stocks. Showing the consistent outperformance of this universal LSTM relative to other asset-specific parametric models, we uncover nonparametric evidences of a universal volatility formation mechanism across assets relating past market realizations, including daily returns and volatilities, to current volatilities. A parsimonious parametric forecasting device combining the rough fractional stochastic volatility and quadratic rough Heston models with fixed parameters results in the same level of performance as the universal LSTM, which confirms the universality of the volatility formation process from a parametric perspective.
    Cost-Efficient Distributed Learning via Combinatorial Multi-Armed Bandits. (arXiv:2202.08302v2 [cs.IT] UPDATED)
    We consider the distributed SGD problem, where a main node distributes gradient calculations among $n$ workers. By assigning tasks to all the workers and waiting only for the $k$ fastest ones, the main node can trade-off the algorithm's error with its runtime by gradually increasing $k$ as the algorithm evolves. However, this strategy, referred to as adaptive $k$-sync, neglects the cost of unused computations and of communicating models to workers that reveal a straggling behavior. We propose a cost-efficient scheme that assigns tasks only to $k$ workers, and gradually increases $k$. We introduce the use of a combinatorial multi-armed bandit model to learn which workers are the fastest while assigning gradient calculations. Assuming workers with exponentially distributed response times parameterized by different means, we give empirical and theoretical guarantees on the regret of our strategy, i.e., the extra time spent to learn the mean response times of the workers. Furthermore, we propose and analyze a strategy applicable to a large class of response time distributions. Compared to adaptive $k$-sync, our scheme achieves significantly lower errors with the same computational efforts and less downlink communication while being inferior in terms of speed.
    Continuous Treatment Recommendation with Deep Survival Dose Response Function. (arXiv:2108.10453v4 [stat.ML] UPDATED)
    We propose a general formulation for continuous treatment recommendation problems in settings with clinical survival data, which we call the Deep Survival Dose Response Function (DeepSDRF). That is, we consider the problem of learning the conditional average dose response (CADR) function solely from historical data in which observed factors (confounders) affect both observed treatment and time-to-event outcomes. The estimated treatment effect from DeepSDRF enables us to develop recommender algorithms with the correction for selection bias. We compared two recommender approaches based on random search and reinforcement learning and found similar performance in terms of patient outcome. We tested the DeepSDRF and the corresponding recommender on extensive simulation studies and the eICU Research Institute (eRI) database. To the best of our knowledge, this is the first time that causal models are used to address the continuous treatment effect with observational data in a medical context.
    Graph-Based Machine Learning Improves Just-in-Time Defect Prediction. (arXiv:2110.05371v2 [cs.SE] UPDATED)
    The increasing complexity of today's software requires the contribution of thousands of developers. This complex collaboration structure makes developers more likely to introduce defect-prone changes that lead to software faults. Determining when these defect-prone changes are introduced has proven challenging, and using traditional machine learning (ML) methods to make these determinations seems to have reached a plateau. In this work, we build contribution graphs consisting of developers and source files to capture the nuanced complexity of changes required to build software. By leveraging these contribution graphs, our research shows the potential of using graph-based ML to improve Just-In-Time (JIT) defect prediction. We hypothesize that features extracted from the contribution graphs may be better predictors of defect-prone changes than intrinsic features derived from software characteristics. We corroborate our hypothesis using graph-based ML for classifying edges that represent defect-prone changes. This new framing of the JIT defect prediction problem leads to remarkably better results. We test our approach on 14 open-source projects and show that our best model can predict whether or not a code change will lead to a defect with an F1 score as high as 77.55$\%$. This represents an increase of as much as 46.72$\%$ over the state-of-the-art in JIT defect prediction. We describe limitations, open challenges, and how this method can be used for operational JIT defect prediction.
    An Expert System for Redesigning Software for Cloud Applications. (arXiv:2109.14569v3 [cs.LG] UPDATED)
    Cloud-based software has many advantages. When services are divided into many independent components, they are easier to update. Also, during peak demand, it is easier to scale cloud services (just hire more CPUs). Hence, many organizations are partitioning their monolithic enterprise applications into cloud-based microservices. Recently there has been much work using machine learning to simplify this partitioning task. Despite much research, no single partitioning method can be recommended as generally useful. More specifically, those prior solutions are "brittle"; i.e. if they work well for one kind of goal in one dataset, then they can be sub-optimal if applied to many datasets and multiple goals. In order to find a generally useful partitioning method, we propose DEEPLY. This new algorithm extends the CO-GCN deep learning partition generator with (a) a novel loss function and (b) some hyper-parameter optimization. As shown by our experiments, DEEPLY generally outperforms prior work (including CO-GCN, and others) across multiple datasets and goals. To the best of our knowledge, this is the first report in SE of such stable hyper-parameter optimization. To aid reuse of this work, DEEPLY is available on-line at https://bit.ly/2WhfFlB.
    Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse. (arXiv:2206.13714v1 [cs.LG])
    Real-world sequential decision making requires data-driven algorithms that provide practical guarantees on performance throughout training while also making efficient use of data. Model-free deep reinforcement learning represents a framework for such data-driven decision making, but existing algorithms typically only focus on one of these goals while sacrificing performance with respect to the other. On-policy algorithms guarantee policy improvement throughout training but suffer from high sample complexity, while off-policy algorithms make efficient use of data through sample reuse but lack theoretical guarantees. In order to balance these competing goals, we develop a class of Generalized Policy Improvement algorithms that combines the policy improvement guarantees of on-policy methods with the efficiency of theoretically supported sample reuse. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a variety of continuous control tasks from the DeepMind Control Suite.
    Benchopt: Reproducible, efficient and collaborative optimization benchmarks. (arXiv:2206.13424v2 [cs.LG] UPDATED)
    Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings.
    Towards a Grounded Theory of Causation for Embodied AI. (arXiv:2206.13973v1 [cs.AI])
    There exist well-developed frameworks for causal modelling, but these require rather a lot of human domain expertise to define causal variables and perform interventions. In order to enable autonomous agents to learn abstract causal models through interactive experience, the existing theoretical foundations need to be extended and clarified. Existing frameworks give no guidance regarding variable choice / representation, and more importantly, give no indication as to which behaviour policies or physical transformations of state space shall count as interventions. The framework sketched in this paper describes actions as transformations of state space, for instance induced by an agent running a policy. This makes it possible to describe in a uniform way both transformations of the micro-state space and abstract models thereof, and say when the latter is veridical / grounded / natural. We then introduce (causal) variables, define a mechanism as an invariant predictor, and say when an action can be viewed as a ``surgical intervention'', thus bringing the objective of causal representation & intervention skill learning into clearer focus.
    Integral Transforms in a Physics-Informed (Quantum) Neural Network setting: Applications & Use-Cases. (arXiv:2206.14184v1 [quant-ph])
    In many computational problems in engineering and science, function or model differentiation is essential, but also integration is needed. An important class of computational problems include so-called integro-differential equations which include both integrals and derivatives of a function. In another example, stochastic differential equations can be written in terms of a partial differential equation of a probability density function of the stochastic variable. To learn characteristics of the stochastic variable based on the density function, specific integral transforms, namely moments, of the density function need to be calculated. Recently, the machine learning paradigm of Physics-Informed Neural Networks emerged with increasing popularity as a method to solve differential equations by leveraging automatic differentiation. In this work, we propose to augment the paradigm of Physics-Informed Neural Networks with automatic integration in order to compute complex integral transforms on trained solutions, and to solve integro-differential equations where integrals are computed on-the-fly during training. Furthermore, we showcase the techniques in various application settings, numerically simulating quantum computer-based neural networks as well as classical neural networks.
    Supervised Learning with General Risk Functionals. (arXiv:2206.13648v1 [stat.ML])
    Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class. The emergence of risk-sensitive learning requires generalization guarantees for functionals of the loss distribution beyond the expectation. While prior works specialize in uniform convergence of particular functionals, our work provides uniform convergence for a general class of H\"older risk functionals for which the closeness in the Cumulative Distribution Function (CDF) entails closeness in risk. We establish the first uniform convergence results for estimating the CDF of the loss distribution, yielding guarantees that hold simultaneously both over all H\"older risk functionals and over all hypotheses. Thus licensed to perform empirical risk minimization, we develop practical gradient-based methods for minimizing distortion risks (widely studied subset of H\"older risks that subsumes the spectral risks, including the mean, conditional value at risk, cumulative prospect theory risks, and others) and provide convergence guarantees. In experiments, we demonstrate the efficacy of our learning procedure, both in settings where uniform convergence results hold and in high-dimensional settings with deep networks.
    Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL. (arXiv:2206.14057v1 [cs.LG])
    While the primary goal of the exploration phase in reward-free reinforcement learning (RF-RL) is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
    Studying Generalization Through Data Averaging. (arXiv:2206.13669v1 [stat.ML])
    The generalization of machine learning models has a complex dependence on the data, model and learning algorithm. We study train and test performance, as well as the generalization gap given by the mean of their difference over different data set samples to understand their ``typical" behavior. We derive an expression for the gap as a function of the covariance between the model parameter distribution and the train loss, and another expression for the average test performance, showing test generalization only depends on data-averaged parameter distribution and the data-averaged loss. We show that for a large class of model parameter distributions a modified generalization gap is always non-negative. By specializing further to parameter distributions produced by stochastic gradient descent (SGD), along with a few approximations and modeling considerations, we are able to predict some aspects about how the generalization gap and model train and test performance vary as a function of SGD noise. We evaluate these predictions empirically on the Cifar10 classification task based on a ResNet architecture.
    Differentially Private Algorithms for Statistical Verification of Cyber-Physical Systems. (arXiv:2004.00275v2 [cs.LG] UPDATED)
    Statistical model checking is a class of sequential algorithms that can verify specifications of interest on an ensemble of cyber-physical systems (e.g., whether 99% of cars from a batch meet a requirement on their energy efficiency). These algorithms infer the probability that given specifications are satisfied by the systems with provable statistical guarantees by drawing sufficient numbers of independent and identically distributed samples. During the process of statistical model checking, the values of the samples (e.g., a user's car energy efficiency) may be inferred by intruders, causing privacy concerns in consumer-level applications (e.g., automobiles and medical devices). This paper addresses the privacy of statistical model checking algorithms from the point of view of differential privacy. These algorithms are sequential, drawing samples until a condition on their values is met. We show that revealing the number of the samples drawn can violate privacy. We also show that the standard exponential mechanism that randomizes the output of an algorithm to achieve differential privacy fails to do so in the context of sequential algorithms. Instead, we relax the conservative requirement in differential privacy that the sensitivity of the output of the algorithm should be bounded to any perturbation for any data set. We propose a new notion of differential privacy which we call expected differential privacy. Then, we propose a novel expected sensitivity analysis for the sequential algorithm and proposed a corresponding exponential mechanism that randomizes the termination time to achieve the expected differential privacy. We apply the proposed mechanism to statistical model checking algorithms to preserve the privacy of the samples they draw. The utility of the proposed algorithm is demonstrated in a case study.
    Statistical inference with implicit SGD: proximal Robbins-Monro vs. Polyak-Ruppert. (arXiv:2206.12663v2 [stat.ML] UPDATED)
    The implicit stochastic gradient descent (ISGD), a proximal version of SGD, is gaining interest in the literature due to its stability over (explicit) SGD. In this paper, we conduct an in-depth analysis of the two modes of ISGD for smooth convex functions, namely proximal Robbins-Monro (proxRM) and proximal Poylak-Ruppert (proxPR) procedures, for their use in statistical inference on model parameters. Specifically, we derive non-asymptotic point estimation error bounds of both proxRM and proxPR iterates and their limiting distributions, and propose on-line estimators of their asymptotic covariance matrices that require only a single run of ISGD. The latter estimators are used to construct valid confidence intervals for the model parameters. Our analysis is free of the generalized linear model assumption that has limited the preceding analyses, and employs feasible procedures. Our on-line covariance matrix estimators appear to be the first of this kind in the ISGD literature.
    Understanding Benign Overfitting in Nested Meta Learning. (arXiv:2206.13482v1 [cs.LG] CROSS LISTED)
    Meta learning has demonstrated tremendous success in few-shot learning with limited supervised data. In those settings, the meta model is usually overparameterized. While the conventional statistical learning theory suggests that overparameterized models tend to overfit, empirical evidence reveals that overparameterized meta learning methods still work well -- a phenomenon often called ``benign overfitting.'' To understand this phenomenon, we focus on the meta learning settings with a challenging nested structure that we term the nested meta learning, and analyze its generalization performance under an overparameterized meta learning model. While our analysis uses the relatively tractable linear models, our theory contributes to understanding the delicate interplay among data heterogeneity, model adaptation and benign overfitting in nested meta learning tasks. We corroborate our theoretical claims through numerical simulations.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v2 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(m/N+1/m+\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $N$ is the total sample size of the whole system, $m$ is the worker number, and $1-\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(1/N+{({(m^{-1}\lambda^2)}^{\frac{\alpha}{2}}+ m^{-\alpha})}/{N^{1-\frac{\alpha}{2}}})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/Generalization-of-DSGD.  ( 2 min )
    Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning. (arXiv:2108.03706v3 [stat.ML] UPDATED)
    The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments.  ( 3 min )
    Stochastic linear optimization never overfits with quadratically-bounded losses on general data. (arXiv:2202.06915v2 [cs.LG] UPDATED)
    This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on Markov chains (all prior stochastic TD bounds are in expectation).  ( 2 min )
    Rankings from multimodal pairwise comparisons. (arXiv:2206.13580v1 [stat.ML])
    The task of ranking individuals or teams, based on a set of comparisons between pairs, arises in various contexts, including sporting competitions and the analysis of dominance hierarchies among animals and humans. Given data on which competitors beat which others, the challenge is to rank the competitors from best to worst. Here we study the problem of computing rankings when there are multiple, potentially conflicting modes of comparison, such as multiple types of dominance behaviors among animals. We assume that we do not know a priori what information each behavior conveys about the ranking, or even whether they convey any information at all. Nonetheless we show that it is possible to compute a ranking in this situation and present a fast method for doing so, based on a combination of an expectation-maximization algorithm and a modified Bradley-Terry model. We give a selection of example applications to both animal and human competition.  ( 2 min )
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v4 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.  ( 3 min )
    Feature Learning for Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v1 [cs.LG])
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are strongly distorted or hidden by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate an optimized set of data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, called neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments using synthetic datasets and multiple case studies on real-world datasets.  ( 2 min )
    Memory Safe Computations with XLA Compiler. (arXiv:2206.14148v1 [cs.LG])
    Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside the computational framework. This impairs the adoption and use of such algorithms. To address this, we developed an XLA compiler extension that adjusts the computational data-flow representation of an algorithm according to a user-specified memory limit. We show that k-nearest neighbour and sparse Gaussian process regression methods can be run at a much larger scale on a single device, where standard implementations would have failed. Our approach leads to better use of hardware resources. We believe that further focus on removing memory constraints at a compiler level will widen the range of machine learning methods that can be developed in the future.  ( 2 min )
    Nonparametric, Nonasymptotic Confidence Bands with Paley-Wiener Kernels for Band-Limited Functions. (arXiv:2206.13629v1 [stat.ML])
    The paper introduces a method to construct confidence bands for bounded, band-limited functions based on a finite sample of input-output pairs. The approach is distribution-free w.r.t. the observation noises and only the knowledge of the input distribution is assumed. It is nonparametric, that is, it does not require a parametric model of the regression function and the regions have non-asymptotic guarantees. The algorithm is based on the theory of Paley-Wiener reproducing kernel Hilbert spaces. The paper first studies the fully observable variant, when there are no noises on the observations and only the inputs are random; then it generalizes the ideas to the noisy case using gradient-perturbation methods. Finally, numerical experiments demonstrating both cases are presented.  ( 2 min )
    Electronic-structure properties from atom-centered predictions of the electron density. (arXiv:2206.14087v1 [physics.chem-ph])
    The electron density of a molecule or material has recently received major attention as a target quantity of machine-learning models. A natural choice to construct a model that yields transferable and linear-scaling predictions is to represent the scalar field using a multi-centered atomic basis analogous to that routinely used in density fitting approximations. However, the non-orthogonality of the basis poses challenges for the learning exercise, as it requires accounting for all the atomic density components at once. We devise a gradient-based approach to directly minimize the loss function of the regression problem in an optimized and highly sparse feature space. In so doing, we overcome the limitations associated with adopting an atom-centered model to learn the electron density over arbitrarily complex datasets, obtaining extremely accurate predictions. The enhanced framework is tested on 32-molecule periodic cells of liquid water, presenting enough complexity to require an optimal balance between accuracy and computational efficiency. We show that starting from the predicted density a single Kohn-Sham diagonalization step can be performed to access total energy components that carry an error of just 0.1 meV/atom with respect to the reference density functional calculations. Finally, we test our method on the highly heterogeneous QM9 benchmark dataset, showing that a small fraction of the training data is enough to derive ground-state total energies within chemical accuracy.  ( 3 min )
    AutoInit: Automatic Initialization via Jacobian Tuning. (arXiv:2206.13568v1 [stat.ML])
    Good initialization is essential for training Deep Neural Networks (DNNs). Oftentimes such initialization is found through a trial and error approach, which has to be applied anew every time an architecture is substantially modified, or inherited from smaller size networks leading to sub-optimal initialization. In this work we introduce a new and cheap algorithm, that allows one to find a good initialization automatically, for general feed-forward DNNs. The algorithm utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality. We solve the dynamics of the algorithm for fully connected networks with ReLU and derive conditions for its convergence. We then extend the discussion to more general architectures with BatchNorm and residual connections. Finally, we apply our method to ResMLP and VGG architectures, where the automatic one-shot initialization found by our method shows good performance on vision tasks.  ( 2 min )
    Dynamic Memory for Interpretable Sequential Optimisation. (arXiv:2206.13960v1 [cs.LG])
    Real-world applications of reinforcement learning for recommendation and experimentation faces a practical challenge: the relative reward of different bandit arms can evolve over the lifetime of the learning agent. To deal with these non-stationary cases, the agent must forget some historical knowledge, as it may no longer be relevant to minimise regret. We present a solution to handling non-stationarity that is suitable for deployment at scale, to provide business operators with automated adaptive optimisation. Our solution aims to provide interpretable learning that can be trusted by humans, whilst responding to non-stationarity to minimise regret. To this end, we develop an adaptive Bayesian learning agent that employs a novel form of dynamic memory. It enables interpretability through statistical hypothesis testing, by targeting a set point of statistical power when comparing rewards and adjusting its memory dynamically to achieve this power. By design, the agent is agnostic to different kinds of non-stationarity. Using numerical simulations, we compare its performance against an existing proposal and show that, under multiple non-stationary scenarios, our agent correctly adapts to real changes in the true rewards. In all bandit solutions, there is an explicit trade-off between learning and achieving maximal performance. Our solution sits on a different point on this trade-off when compared to another similarly robust approach: we prioritise interpretability, which relies on more learning, at the cost of some regret. We describe the architecture of a large-scale deployment of automatic optimisation-as-a-service where our agent achieves interpretability whilst adapting to changing circumstances.  ( 3 min )

  • Open

    Yandex Open-Sources YaLM Model With 100 Billion Parameters
    Transformers are used for translation and text summarising tasks because they can analyze sequential input data, such as natural language. Transformers use the self-attention process and weights the importance of each component of the input data differently. Large-scale transformer-based language models have gained a lot of popularity recently in the disciplines of computer vision and natural language processing (NLP). They expand in size and complexity frequently, yet it costs millions of dollars, hires the greatest experts, and takes years to construct these models. Because of this, many companies have been unable to use it, and only significant IT organizations have access to this cutting-edge technology. To address these problems, Yandex has developed the largest YaLM model to date, which uses 100 billion parameters. This largest GPT-like neural network for English is currently available for free. The researchers used a pool of 800 A100 graphics cards, 1.7 TB of online materials, books, and countless other sources to train the model over the course of 65 days. They have published the model and relevant materials on GitHub under the Apache 2.0 license, allowing both academic and commercial use. Continue reading | Github submitted by /u/shobha-kakkar [link] [comments]  ( 84 min )
    AI Dream 58 - Unbelievable Explosive Midjourney
    submitted by /u/LordPewPew777 [link] [comments]  ( 82 min )
    How can I get free access to a computer server to use Ai and Photoshop there, my pc is very old, is there a service which provides a free trial?
    submitted by /u/TheblackRook3 [link] [comments]  ( 82 min )
    Google's latest image AI Parti beats Imagen, which is only four weeks old (and DALL-E 2 as well)
    submitted by /u/henlo_there_fren [link] [comments]  ( 83 min )
    First photo I've published from NightCafe
    submitted by /u/PineappleTreePro [link] [comments]  ( 82 min )
    Annotated KDD 2022 paper - Learning Backward Compatible Embeddings
    I read a super interesting KDD 2022 paper recently - "Learning Backward Compatible Embeddings". The paper tackles a common industry problem of ensuring compatibility of newer embeddings with an older downstream model. An annotated version of the paper - Annotated-ML-Papers/Learning Backward Compatible Embeddings.pdf submitted by /u/shreyansh26 [link] [comments]  ( 82 min )
    A New Technique to Train Diffusion Model in Latent Space Using Limited Computational Resources While Maintaining High-Resolution Quality
    In recent years, image synthesis has experienced exponential growth in performance. The two main approaches to this task have been autoregressive transformers (ARs) and generative adversarial networks (GANs). The firsts are trained for sequence prediction and are able to generate images, token by token, starting from the first one. The seconds are based on the famous generator-discriminator method, where the generator tries to fool the discriminator into generating reliable samples. Nevertheless, both approaches have huge limitations: in particular, ARs require billions of parameters to be trained, while GANs rely on the minimax loss which has been demonstrated to often bring to mode collapse and instability in training. Diffusion models (DMs) have recently shown excellent results in different image synthesis tasks. They are based on two stages: in the first, noise is added to data step by step in a Markov chain modality, meaning that each step depends solely on the previous step. This process is repeated until losing the majority of the information of the original sample. Then, a denoising process is applied, aiming to reconstruct the image from the noisy version. Continue reading | Checkout the paper and github ​ https://i.redd.it/4w5te1twbe891.gif submitted by /u/shobha-kakkar [link] [comments]  ( 84 min )
    A World Undone" collection so far | NFT's for environmental protection
    submitted by /u/VictorTuring [link] [comments]  ( 83 min )
    "A World Undone" collection so far | NFT's for environmental protection
    submitted by /u/VictorTuring [link] [comments]  ( 82 min )
    Getting started in AI that analyses data for someone who already knows programming...?
    Hey there! I've been making games and programming to do that for around 8 years now. I have a pretty good know-how of most major programming languages, technologies, techniques in that realm by now but there's one thing that I've always struggled with: AI. Whenever I try to research it, I always seem to end up going down "buzz-word loopholes" as I like to call them, similar to a certain extent, to how if you try to research VR game dev now you might end up looking into the "metaverse" and stuff... I find lots of articles / youtube videos that explain either one very specific thing or they go too far abstract and explain how it works but not how to actually do it. What I'm really interested in is designing algorithms like Youtube's, Tiktok's or Google's, the ones that analyse large datasets and alter the platform depending on the results. I know quite a bit of this is machine learning now, but I actually want to gain a good understanding of how to write these algorithms and how to actually implement machine learning to make them, since I can think of many use cases where AI like this could be used in game development and other areas I'm interested in - Not to mention, this just sounds like a fun thing to learn! I'm happy to work with whatever languages and to learn new tools (of course!), but what I am really interested in is learning to create AI that analyses data specifically. All I've found so far when I wasn't just hitting those "buzz-word loopholes" was simple AI that can do things like solving sudoku or more complex ones like analysing images - But the issue is to a certain degree I already understand those sub-topics, and it isn't really the kind of AI I'm interested in learning about. ​ TLDR; So yeah, if anyone has any recommendations of recourses specifically targeted at AI that analyses data, or if I'm completely wrong and need to learn something else first, please chuck us a comment, it'd be much appreciated! submitted by /u/Ping-and-Pong [link] [comments]  ( 86 min )
    Hi, is there an AI of some sort which i can feed to a bunch of random images and let it create a sort of blend of them?
    submitted by /u/disnotmeiswear [link] [comments]  ( 83 min )
    AI Art Charity Project
    I am working on a midjourney based project to raise money for the AI for Good Foundation (ai4good.org). If anyone is interested in hearing more, please feel free to send me a dm. We are looking for volunteers in a few different areas: 1) (Extremely) part time experts in ML/AI/GANs to awnser people's questions 2) Artists to make AI/human collaboration art and post it in our "cyborg gallery" 3) People to make AI art using Midjourney (we have invites) and post them to Reddit etc. with a link to our Discord server. 4) (most important) people to handle prompt requests and generate/send people their results This project was given the green light by a team member at MJ, so they are fine with our charity project. Thanks for reaching, and once again, please reach out if you are interested! submitted by /u/Accomplished_Head5 [link] [comments]  ( 83 min )
    Human biases in Artificial Intelligence
    submitted by /u/HumanSeeing [link] [comments]  ( 83 min )
    What is the computation cost of a DALL-E image generation?
    submitted by /u/theo_champion [link] [comments]  ( 82 min )
    AI GENERATED ART (but it is horrifying)
    submitted by /u/CALP_is_holy [link] [comments]  ( 82 min )
    Google's powerful AI spotlights a human cognitive glitch: Mistaking fluent speech for fluent thought
    submitted by /u/bartturner [link] [comments]  ( 82 min )
    I Made an AI That Punishes Me if it Detects That I am Procrastinating on My Assignments
    submitted by /u/_ayushp_ [link] [comments]  ( 86 min )
    Elect Lamda
    submitted by /u/IwishIwasinOhio [link] [comments]  ( 83 min )
    BOSSCHAERT BOUQET | 4K 24 FPS (FILM EDIT) | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 83 min )
  • Open

    [D] Showing the important design decisions in MLOPS you made in your past jobs
    I've heard a lot that it's not just the tools that matter the most in the field of machine learning and MLOps, but mostly the design decisions one has to make in order to make stable models be pushed to production. And the design pattern varies from one organisation to another. While applying for a new job, how does an MLOps engineer showcase the most important design decisions they made in their past career? submitted by /u/metalvendetta [link] [comments]  ( 84 min )
    Understanding the difference between Time Series Analysis and "normal" Prediction with Regression in Forecasting? [D]
    I am currently working on a dataset for earnings of a specific market segment. For that I have created a dataset with the earnings as my dependent variable and multiple market parameters as my independent variables. For that I have accumulated data on weather, avg speed and so on for each month in 30 years, in total 80 different variables. Now I wanted to create a forecast with different time shifts (1 month, 2, 3, 4...) The goal is to have a forecast for market movement with data some x months prior. "How could earnings look next month with the knowledge of today". I used different regression methods, compared them and now have a model that can predict these values with an accuracy X. The results themselves arent bad, but not as good as I imagined. However, I think thats normal regardin…  ( 87 min )
    [D][P] YOLOv6: state-of-the-art object detection at 1242 FPS
    YOLOv6 has been making a lot of noise in the past 24 hours. Based on its performance - rightfully so. YOLOv6 is a single-stage object detection framework dedicated to industrial applications, with hardware-friendly efficient design and high performance. It outperforms YOLOv5 in accuracy and inference speed, making it the best OS version of YOLO architecture for production applications. I dived into the technical details published by the research group and made a qualitative and qualitative comparison between the results of YOLOv5 and YOLOv6. I invite you to read about all of these, with a bit of history on YOLO, in the my new blog submitted by /u/RepresentativeCod613 [link] [comments]  ( 84 min )
    Creating and Analyzing a Dataset of Roe v. Wade Tweets Labeled by Abortion Stance [P]
    How do pro-choice vs. pro-life twitter users differ? I built a free, labeled dataset of #RoeVsWade tweets, and an ML classifier on top. Some insights: Pro-life users are 20.4x more likely to put "christ" and 16.1x more likely to put "maga" in their bio.Pro-choice users are 7.5x more likely to put "blm" and 6.5x more likely to put "she/her". Full analysis + link to raw dataset here. submitted by /u/BB4evaTB12 [link] [comments]  ( 84 min )
    [N] PyTorch 1.12: TorchArrow, Functional API for Modules and NvFuser
    PyTorch 1.12 Release Notes Highlights Backwards Incompatible Change New Features Improvements Performance Documentation Highlights We are excited to announce the release of PyTorch 1.12! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP API. We want to sincerely thank our dedicated community for your contributions. Summary: Functional Module API to functionally apply module computation with a given set of parameters Complex32 and Complex Convolutions in PyTorch DataPipes from TorchData fully backward compatible with DataLoader Functorch with improved coverage for APIs nvFuser a deep learning compiler for PyTorch Changes to float32 matrix multiplication precision on Ampere and later CUDA hardware TorchArrow, a new beta library for machine learning preprocessing over batch data https://github.com/pytorch/pytorch/releases/tag/v1.12.0 https://pytorch.org/blog/pytorch-1.12-released/ submitted by /u/DreamFlasher [link] [comments]  ( 85 min )
    [D] [P] Questions about the usability of Shapley values on large feature spaces.
    Hello! I am planning a research project which involves creating a classification DNN that takes in a frame from a molecular dynamics simulation of a protein which encodes each amino acid's level of energetic interaction and tries to predict whether that frame came from protein state "A" or protein state "B." I want to analyze the feature importance, that is, the importance of amino acid's energetic interaction level for making the classification prediction. Although I have heard some interesting applications with Shapley values to preform such an analysis on feature importance, the input layer structure of the model I am thinking of making would require 100+ neurons as there are 100+ features. The reason why the feature space is so large is because I am investigating how a model learns which amino acids are most important for the model to make a classification prediction for which state a protein is in where the protein is 100+ amino acids in length. Can Shapley methods handle a feature space of a model that large /would the computational cost of such a process be infeasible? Apologies if this question is a little unclear let me know if anything needs to be clarified. Thanks! submitted by /u/ben_cow [link] [comments]  ( 85 min )
    [R] Annotated KDD 2022 paper - Learning Backward Compatible Embeddings
    I read a super interesting KDD 2022 paper recently - "Learning Backward Compatible Embeddings". The paper tackles a common industry problem of ensuring compatibility of newer embeddings with an older downstream model. An annotated version of the paper - Annotated-ML-Papers/Learning Backward Compatible Embeddings.pdf submitted by /u/shreyansh26 [link] [comments]  ( 84 min )
    [D] Run apps and dev environments in the cloud with a single command
    Hi everyone, I'm the creator of dstack, a tool that makes it easier to train models in the cloud. Our tool allows extending it with custom providers to support different languages, frameworks, etc. All the built-in providers are also open-source. Today, we've released a new update that extends the capabilities of dstack beyond training models, and now also allows users to quickly build and share apps with Streamlit, Gradio, and FastAPI in the cloud – in just a few clicks. Similar to apps, it's possible to run dev environments with required hardware and data access also in one command from the Terminal. All you have to do is to link your own AWS account to run commands. Invite everyone to read it, and share their thoughts. Happy to discuss the approach and what would be great to have! Blog post: https://blog.dstack.ai/introducing-apps-and-dev-environments P.S.: Currently, it's possible to run models and apps only in the configured cloud. If you'd like the tool to also allow you to run it locally, and if you would like this part to be open-source too, please leave comments! 🤗 submitted by /u/cheptsov [link] [comments]  ( 85 min )
    [R] Softmax Linear Units
    submitted by /u/the_great_magician [link] [comments]  ( 83 min )
    [R] Probabilistic Numerics: Computation as Machine Learning (Free Book!)
    Abs: Probabilistic numerical computation formalises the connection between machine learning and applied mathematics. Numerical algorithms approximate intractable quantities from computable ones. They estimate integrals from evaluations of the integrand, or the path of a dynamical system described by differential equations from evaluations of the vector field. In other words, they infer a latent quantity from data. This book shows that it is thus formally possible to think of computational routines as learning machines, and to use the notion of Bayesian inference to build more flexible, efficient, or customised algorithms for computation. The text caters for Masters' and PhD students, as well as postgraduate researchers in artificial intelligence, computer science, statistics, and applied mathematics. Extensive background material is provided along with a wealth of figures, worked examples, and exercises (with solutions) to develop intuition. Link to book: https://www.probabilistic-numerics.org/textbooks/ submitted by /u/bikeskata [link] [comments]  ( 85 min )
    [D] Have compression techniques every been applied to the likes of GPT-3 & DALLE-2?
    Large language models and the recent spur of diffusion based text-to-image models are gosh-darn fun to play with, but due to their size and expensive training costs - they're only accessible via an API or if you yourself have a access to a large # of GPUs. Yet there are also a number of compression techniques like pruning and quantization that can drastically reduce the size (+90%), and thus computational requirements, of a trained model. Has there been any work looking appling such techniques to these gigantic models floating around to make them more accessible? submitted by /u/Farconion [link] [comments]  ( 86 min )
    [P] Clustering long documents with Transformers in 10 minutes
    Transformers are awesome for so many things in 2022, but one thing I've found them to struggle with is generating embeddings for long documents. I put together a blog post going through some interesting techniques. Let me know if it helped you! Blog post submitted by /u/BlockDesigns [link] [comments]  ( 84 min )
    [N] Quaterion, a blazingly fast framework for similarity learning.
    Just released. Quaterion — an open source framework for training and fine-tuning similarity learning models. It enables you to train models significantly (100x) faster, and iterate over experiments in minutes instead of hours even with a laptop GPU. It takes advantage of the PyTorch Lightning backend to make a flexible and scalable learning pipeline. GitHub https://github.com/qdrant/quaterion Here is a demo of the caching functionality. https://i.redd.it/9qi8gf9n4d891.gif submitted by /u/devzaya [link] [comments]  ( 84 min )
    [p] RestifyML - AI/ML Tool for Developers to quickly experiment with data and generate AI/ML REST API to consume back into their application
    Developers can use RestifyML to Create DataScience experiments Create Data Source and upload CSV data within the experiment Do Data Cleansing and Sanitization Visualize raw data using Data Exploration Select Features which would help in building models Build Model, save or export them Finally, deploy Model and expose them as REST API Consume Machine Learning REST API from any Application Profit! https://github.com/rebataur/RestifyML Feedback/ Feature Request appreciated. submitted by /u/rebataur [link] [comments]  ( 85 min )
    [D]Can a transformer neural network learn to predict sequences longer than it saw?
    Simple task: transformer has to repeat a sequence of random integers (0-9) of varied length, like: sequence length=7: input[ 1, 3 ,5 ,6, 2, 4, 0] - output[ 1, 3 ,5 ,6, 2, 4, 0] sequence length=3: input[ 5, 4 ,9 ] - output[ 5, 4 ,9 ] sequence length=4: input[ 6, 3 ,9, 8 ] - output[ 6, 3 ,9, 8 ] ... Each integer(0-9) can be stored in embedding layer so we can pass it to transformer. I trained transformer (generic pytorch model with positional embeddings) on a dataset (1000 examples) of sequences of varied length (1 to 12) and it predicts sequences well within the range of 12 . It fails to predict sequences longer than 12 - 13. sequence length=20: input[3, 3, 4, 0, 0, 7, 1, 5, 1, 0, 7, 1, 9, 0, 9, 1, 5, 2, 3, 6] .............................. ...- output[3, 3, 4, 0, 0, 7, 1, 5, 1, 0, 7, 1, 7, 1, 7, 1, 0, 7, 0, 7] Is it considered an extrapolation task? Are there types of transformers (or other neural networks) that can handle the problem ? Same issue with recurrent neural networks (RNN, LSTM, GRU). submitted by /u/InternationalVisito [link] [comments]  ( 89 min )
    [N] PyTorch 1.12 released
    Pytorch 1.12 is available through the pytorch conda channel and pypi Release notes Issue tracker Highlights We are excited to announce the release of PyTorch 1.12! This release is composed of over 3124 commits, 433 contributors. Along with 1.12, we are releasing beta versions of AWS S3 Integration, PyTorch Vision Models on Channels Last on CPU, Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16 and FSDP API. We want to sincerely thank our dedicated community for your contributions. Summary: Functional Module API to functionally apply module computation with a given set of parameters Complex32 and Complex Convolutions in PyTorch DataPipes from TorchData fully backward compatible with DataLoader Functorch with improved coverage for APIs nvFuser a deep learning compiler for PyTorch Changes to float32 matrix multiplication precision on Ampere and later CUDA hardware TorchArrow, a new beta library for machine learning preprocessing over batch data Other noteable changes CUDA 11.6 wheels torch.amp module submitted by /u/M4mb0 [link] [comments]  ( 85 min )
    [P] DALL-E Mini stripped to its bare essentials and converted to PyTorch
    submitted by /u/pcaversaccio [link] [comments]  ( 86 min )
    [R] Welcome to my continuous, free live machine learning class with intermediate mathematics
    Dear all, Welcome to join my continued ML knowledge dissemination class via Zoom. I will continue to explain machine learning using an intermediate level mathematics. It happens every second Thursday at GMT at 11:00 (HK7pm / SYD9pm) - the next class is on June 30. The current topic is: "Determinantal Point Process" I'll fully explain its beautiful mathematics over a period of a few sessions. This is a powerful model to model diverse subsets. Yet it is not as commonly used as it should! You can find my notes on my GitHub site: https://github.com/roboticcam/machine-learning-notes/ Determinantal Point Process notes is found at: https://github.com/roboticcam/machine-learning-notes/blob/master/files/dpp_new.pdf You need a solid understanding of linear algebra, calculus, probability and statistics. But if you just want to get a feel of how DPP works for example, and meet like-minded people, please come too! To join, sign up for one of the meetup groups you see fit: https://www.meetup.com/machine-learning-hong-kong/ https://www.meetup.com/deep-learning-sydney/ https://www.meetup.com/Deep-Learning-Melbourne/ https://www.meetup.com/machine-learning-athens/ submitted by /u/MLknowledge [link] [comments]  ( 85 min )
    [D] Surface rendering in Diffusion Probability Text-to-Image Generators.
    Two diffusion text-to-image generators are Google's Imagen and openai's DALLE.2. DALLE.2 uses a multimodal large language model called CLIP to encode an input text prompt. The output is produced by a reverse encoder called a diffusion probability model. Diffusion models have previously seen huge successes in image super resolution and denoising. One peculiar aspect of DALLE.2's output is that it is capable of generating light sources in certain (seemingly) 3D locations in the scene, then correctly lighting the objects based off of their implied location. DALLE.2 can also perform image completions from a starting image prompt. The two examples below are Spongebob dish sponge in a sink, and Vermeer's famous earring painting. https://i.imgur.com/vVI6IOI.png . https://i.imgur.com/8h48lTg.png . One plausible explanation for these physically perfect surface reflections is that DALLE.2 performs a phase where the image is reverse-encoded into a 3D scene. That scene is then rendered back into a 2D output image. However, when consulting the primary literature, no such conversion to a 3D model is seen anywhere along the DALLE.2 workflow. The implication is that DALLE.2 must contain a wealth of priors related to light transport, gleaned simply from 2D training images alone. This means these priors are being applied (mostly correctly) to particular instantiations of objects and surfaces in scenes. This application is performed even to the point where wet metallic surfaces have correct blurring in reflections. Further investigations of this phenomenon would involve finding some user prompts that generated a scene containing light casting a sharp shadow onto a flat surface. Another would be requesting a reflective object in the text prompt itself. Your thoughts? submitted by /u/moschles [link] [comments]  ( 87 min )
    [P] First-class Dims - a generalization of einops and named tensors
    Jupyter Notebook: https://colab.research.google.com/drive/1BsVkddtVMX35aZAvo2GyI-wSFPVBCWuA Github: https://github.com/facebookresearch/torchdim Some tweet threads about it Mine: https://twitter.com/cHHillee/status/1541536627746426881 Sasha Rush: https://twitter.com/srush_nlp/status/1541526906113298433 submitted by /u/programmerChilli [link] [comments]  ( 84 min )
    [D] How to evaluate the gain of a new feature without training?
    When evaluating the effectiveness of a new feature, it is common to train a model with/without this feature to compare the difference. But sometimes training a model based on huge amounts of data is both time and energy consuming. I was wondering if there are some lightweight ways to estimate the importance of the new feature without training? Computing descriptive statistics such as feature coverage, histogram and correlation matrix might be necessary, are there other pre-processing methods? submitted by /u/fishiwhj [link] [comments]  ( 84 min )
    [D] Laplacian positional encodings
    I just finished reading "Benchmarking Graph Neural Networks" (Dwivedi et al. 2020) and "A Generalization of Transformer Networks to Graphs" (also Dwivedi et al. 2020), and came across the claim that the eigenvectors of the Laplacian of a graph "represent a natural generalization of the Transformer (Vaswani et al., 2017) positional encodings (PE)". Xavier Bresson tweeted the same thing. So I worked out the eigenvectors of the Laplacian of a path graph (a line of vertices connected by edges like so: v-v-v-...-v), which is the kind of graph used in NLP to represent a sequence of tokens, and found that the ith eigenvector's kth entry is v_i(k) = cos(πik/n − πi/2n) where n is the number of tokens in the sequence, which is very different from the sinusoidal PEs used in transformers in NLP. I tried working out a change of variables, but nothing's worked so far. Are Laplacian eigenvectors just not the generalizations they're claimed to be, or am I missing something here? submitted by /u/hegelian_waffle [link] [comments]  ( 85 min )
  • Open

    Exploring emerging topics in artificial intelligence policy
    The second AI Policy Forum Symposium convened global stakeholders across sectors to discuss critical policy questions in artificial intelligence.  ( 7 min )
  • Open

    Gaussian Processes for Cartpole Environments
    Good day all, I have previously seen some Fitted Q iteration tutorials in a cart pole environment in which neural networks were used in updating Q values (e. https://github.com/seungjaeryanlee/implementations-nfq/tree/master/nfq). I am interested in doing something similar only, that I have to replace those neural network estimators with Gaussian processes. Please can anyone recommend some useful tutorials (Free/Paid) for using Gaussian processes for Cartpole setup? I have some but they are a little bit too theoretical with little/no practical/ programming. I will also appreciate links to some libraries or repos that provide more insights on the subject matter. submitted by /u/Thin-Ad9581 [link] [comments]  ( 83 min )
    "DALL·E 2 Pre-Training Mitigations", Nichol 2022 (how OA censored it: heavy filtering by training a classifier w/active-learning; reweighting; dupe deletion)
    submitted by /u/gwern [link] [comments]  ( 83 min )
    Animo Island makes machine learning fun and easy to learn so that anyone can harness the power of reinforcement learning! 🤖 🏝️
    submitted by /u/AnimoIsland [link] [comments]  ( 83 min )
    Suicidal Agents (blog post)
    Hey guys, I wrote my first blog post on RL about changing the reward function by a constant and how this can result in a different policy. At first thought this feels strange since the constant should not affect the expected sum of returns! Please let me know what you think. https://ea-aguilar.gitbook.io/rl-vault/food-for-thought/suicidal-agents Also, I'm not such a big fan of medium bc I want to keep the option to write more equations, but it seems it's the de-facto place to blog about ML/RL. Do you recommend also posting there? context: A couple of years ago I made a career switch into RL - and recently have been wanting to write more. So as an exercise, I want to start writing down some cute observations/thoughts about RL. I figure this could also help some people out there who are just now venturing into the field. submitted by /u/EdAlexAguilar [link] [comments]  ( 85 min )
    Simple continuous environment with spaceship but yet challenging for RL algorithms (like SAC, TD3)
    Hello All. We have designed a set of continuous reinforcement learning environments with locomotion tasks in space. The goal is to navigate a (planar) spaceship to reach the prescribed goals, or enter a prescribed orbit. The tasks in general seems simple, but we were surprised that they pose a serious challenge for vanilla RL approaches. We learned a lot from the environment design process. We find it particularly challenging to appropriately shape the reward function such that the RL algorithm converges to a satisfactory control. We used stable-baselines3 implementations of SAC, TD3, PPO with default hiperparameters (tuned for MuJoCo) One set of environments is about reaching the consecutive goals (regenerated randomly). In case there are 2 planets, the SAC agent performs perfectly, and matches the human baseline score (we have a keyboard controlled agent) 4715 +- 799 ​ SAC & TD3 evaluation curves for 2 planet goal env ​ the best agent (SAC) for 2 planet goal env In case an additional planet is added, the SAC agent performs poorly , and its performance is far from the human baseline score 4659 +-747 ​ SAC & TD3 evaluation curves for 2 planet goal env ​ ​ the best agent (SAC) for 3 planet goal env In case of 4 planets the performance drops even further. We could not explain the dramatic performance drop when increasing the number of planets from 2 to more (3,4). AI seemingly could not learn here the principles of the gravitational force . Any ideas which RL algorithm would do better here ? We plan to take a look at Physic-Informed RL. In case you want to take a look the envs are published here https://github.com/MIMUW-RL/space-gym Best, Jacek submitted by /u/dzako1 [link] [comments]  ( 85 min )
    Actions that you can only take once
    We are working on developing a DQN approach to sequence actions. The actions can only be taken once. I have read in several threads that you can prevent illegal actions from being selected both during learning (taking the max value over only legal actions) and during actual policy implementation (same), and like this your policy stays always legal. But my question is: do you need to supply the list of "exhausted actions" as part of the state? How would the q network be able to know what value to expect, when the remaining actions are completely determined by the already taken actions, if they are not supplied as part of the state at the input of the network? I have not found a single reference where the need to input the exhausted actions as part of the state is described. Any help or guidance would be greatly appreciated. C submitted by /u/Fresh-Literature-623 [link] [comments]  ( 85 min )
    I am new to RL, problem understanding on how to apply it
    Hey! I am very new to reinforcement learning and I am writing my bachelor thesis on a game where I have to use a learning method, however I don't really seem to understand how to solve it really. I hope the question is fine. I have to read a paper and implement it, then turn it into a repeated case and apply simple learning to it. The paper is about a pirate-farmer game where there are 3 islands. The farmer chooses an islands and plants flowers on it. The islands have different sizes, one holds 3 flowers, one holds 4 and the last 8. If the pirate chooses the same island as the farmer he gets the flowers, otherwise the farmer keeps them. This game is then played multiple rounds and the paper basically talks about the nash equilibrium of both players to choose each island. I have talked to my tutor about it and she told me to try and apply q-learning to it, however I don't exactly understand how to do that. When I read about q-learning and watched videos about it, people used it for a treasure hunt mostly to find the shortest path from one location to another. However I don't understand how to make the game repeated without changing the game itself, if that makes sense? Sorry if the question doesn't make a lot of sense, like I said I am still pretty new at it. submitted by /u/False-Bluebird-3538 [link] [comments]  ( 85 min )
  • Open

    Create audio for content in multiple languages with the same TTS voice persona in Amazon Polly
    Amazon Polly is a leading cloud-based service that converts text into lifelike speech. Following the adoption of Neural Text-to-Speech (NTTS), we have continuously expanded our portfolio of available voices in order to provide a wide selection of distinct speakers in supported languages. Today, we are pleased to announce four new additions: Pedro speaking US Spanish, […]  ( 5 min )
    New built-in Amazon SageMaker algorithms for tabular data modeling: LightGBM, CatBoost, AutoGluon-Tabular, and TabTransformer
    Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, […]  ( 7 min )
    Semantic segmentation data labeling and model training using Amazon SageMaker
    In computer vision, semantic segmentation is the task of classifying every pixel in an image with a class from a known set of labels such that pixels with the same label share certain characteristics. It generates a segmentation mask of the input images. For example, the following images show a segmentation mask of the cat […]  ( 9 min )
    Deep demand forecasting with Amazon SageMaker
    Every business needs the ability to predict the future accurately in order to make better decisions and give the company a competitive advantage. With historical data, businesses can understand trends, make predictions of what might happen and when, and incorporate that information into their future plans, from product demand to inventory planning and staffing. If […]  ( 10 min )
  • Open

    DALL·E 2 Pre-Training Mitigations
    In order to share the magic of DALL·E 2 with a broad audience, we needed to reduce the risks associated with powerful image generation models. To this end, we put various guardrails in place to prevent generated images from violating our content policy. This post focuses on pre-training  ( 13 min )
  • Open

    NVIDIA Teams With HPE to Take AI From Edge to Cloud
    Enterprises now have a new option for quickly getting started with NVIDIA AI software: the HPE GreenLake edge-to-cloud platform. The NVIDIA AI Enterprise software suite is an end-to-end, cloud-native suite of AI and data analytics software. It’s optimized to enable any organization to use AI, and doesn’t require deep AI expertise. Fully supported by NVIDIA, Read article > The post NVIDIA Teams With HPE to Take AI From Edge to Cloud appeared first on NVIDIA Blog.  ( 5 min )
    Detect to Protect: Taiwan Hospital Deploys Real-Time AI Risk Prediction for Kidney Patients
    Taiwan has nearly 85,000 kidney dialysis patients — the highest prevalence in the world based on population density. Taipei Veterans General Hospital (TVGH) is working to improve outcomes for these patients with an AI model that predicts heart failure risk in real time during dialysis procedures. Cardiovascular disease is the leading cause of death for Read article > The post Detect to Protect: Taiwan Hospital Deploys Real-Time AI Risk Prediction for Kidney Patients appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    time series classification
    Does any one know any good books or tutorials on time series classification using recurrent neural networks (LSTM)? Currently working on an EHR dataset and need to classify/predict disease, I know I can use normal classifiers i.e. SVM or XGboost but wanted to avoid the feature engineering that comes with this and thought neural networks would be the way to go. Just need good guidance on how to go about implementing it via a book or tutorial. Much appreciated submitted by /u/Abeokuta_ [link] [comments]  ( 84 min )
    Converting TensorFlow Keras model API to model subclassing
    For a simple TF2 Object detection CNN architecture defined using Keras's functional API, a batch of data is obtained as: example, label = next(data_generator(batch_size = 32)) example.keys() # dict_keys(['image']) image = example['image'] image.shape # (32, 144, 144, 3) label.keys() # dict_keys(['class_out', 'box_out']) label['class_out'].shape, label['box_out'].shape # ((32, 9), (32, 2)) The CNN architecture defined using Keras's functional API is: input_ = Input(shape = (144, 144, 3), name = 'image') # name - An optional name string for the Input layer. Should be unique in # a model (do not reuse the same name twice). It will be autogenerated if it isn't provided. # Here 'image' is the Python3 dict's key used to map the data to one of the layer in the model. x = input_ # Define a c…  ( 85 min )
  • Open

    What is Social Media Content Moderation and how Moderation Companies use various Techniques to…
    Moderation is the process of controlling the wanted contents from the online platforms like social media networking sites. And it is…  ( 8 min )
    DALL·E 2 — The AI artist that can create and edit images for you!
    “Homer Simpson reacting to the crash of Bitcoin” Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 10 min )

  • Open

    made with starryai
    submitted by /u/rikusorasephiroth [link] [comments]  ( 83 min )
    When your phone knows you get no bitches
    submitted by /u/asscheeseterps710 [link] [comments]  ( 82 min )
    Secured AI-related position within my current company, plan on moving-internally in the Fall. How do I negotiate salary in this case?
    Hi, I currently work in the auto industry, and have recently solidified an opportunity to transfer from my current role in Manufacturing to a role within AI, specifically focusing on Autonomous Driving. I currently work as a data scientist, and am responsible for setting up pipelines, modeling, forecasting, etc.. I am fluent in Python and have some basic introductory experience in Neural Nets and using image tensors (a capstone project in undergrad, graduate school). I currently make 70k but would like to obviously aim higher, given the current market and skills required for this job. I have 3.5 years of experience when looking at my career from a general "computer science" point of view. What is a reasonable amount to expect in a position like this? Should I throw out a large number and work backwards with my company from there? The position is Remote, but based out of MI, USA. I live in the States. Thanks. submitted by /u/Mr15ization [link] [comments]  ( 84 min )
    Weekly China AI News: CVPR 2022 Recap; Meituan Proposes YOLOv6; Tencent Invests in Data Processing Unit Firm
    submitted by /u/trcytony [link] [comments]  ( 82 min )
    An Artificial Intelligence chatbot powers profitability for a multinational bank
    submitted by /u/Diana-RS [link] [comments]  ( 82 min )
    Last Week in AI: AI learns to do tasks in Minecraft, Instagram AI scans faces for age verification, Amazon launches AI pair programming tool, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 83 min )
    LaMDA’s Sentience is Nonsense - Here’s Why
    submitted by /u/regalalgorithm [link] [comments]  ( 82 min )
    The possibility of general Artificial Intelligence
    submitted by /u/Diana-RS [link] [comments]  ( 82 min )
    Device42: AI Webinar Tomorrow
    Hey All, Just wanted to throw a quick remind that Device42 is hosting an upcoming AI webinar with award winning author Steve Shwartz (Evil Robots, Killer Computers, and Other Myths) and our CMO Yama Habibzai tomorrow, June 28th at 11 AM EDT as they discuss the impact of AI in IT and how you can leverage it to achieve more. Save your seat today Cheers. submitted by /u/Device42_Phil [link] [comments]  ( 83 min )
    A Tutorial on Generating Images from Text Prompts with VQGAN-Clip, Python, and TensorFlow
    View the tutorial here: HERE This tutorial teaches you how to convert any text prompt to an image using VQGAN-Clip. For example you could use the prompt "A spray painting of a waiting computer and a bedroom in the style of Edgar Degas and Art Nouveau". This would generate the following image: https://imgur.com/J3qGlc4 Let me know if you have any questions or comments. submitted by /u/mshriver2 [link] [comments]  ( 83 min )
    Two MSc options, not sure which one to go for. Any advice would be appreciated
    Hi all Hope everyone is having a good Monday (Tuesday depending on where you are). I have an undergraduate degree in Economics. I have spent a few years working now (mainly in economic analysis), but I am looking at doing an MSc in the DS/ML/AI space as I think it could help my current job, but equally, I have a genuine interest in the fields so could utilise it for a career switch. The two courses I am looking at are: https://online.essex.ac.uk/courses/msc-data-science/#overview https://online.essex.ac.uk/courses/msc-artificial-intelligence/ I am unsure whether it is better to go a bit more general and opt for the DS course, or go for the AI course. Personally, some of the DS modules I feel I already have the skills in such as the data visualisation, which is pushing me towards AI. Furthermore, when I have done some basic NLP, I have enjoyed that a lot and that's available in the AI course but not the DS course (you can see in the course structure then module choice what is included). In terms of future career, as I said above potential career switch (even if it means I have to go into an entry level job before climbing up the ladder again). A PhD would also be an option. Do you have any advice or general thoughts on the two courses above? Cheers all submitted by /u/chickenparmo [link] [comments]  ( 85 min )
    Large language models have a reasoning problem
    submitted by /u/bendee983 [link] [comments]  ( 83 min )
    AI That Passes the Turing Test Doesn't Guarantee Consciousness (3-minute audio clip from Lex Fridman & Sam Harris)
    submitted by /u/justine01923 [link] [comments]  ( 83 min )
    GPT-3 Powered Mac Writing App - Works Across All Applications
    Hello everyone, I have recently soft-launched a Mac app called Elephas that lets you write faster across applications on your Mac. It was even trending on HackerNews for a few hours. See the GIFs attached to see how it works. ​ Email ​ https://i.redd.it/iv3y0gdqm5891.gif It uses your own OpenAI keys and works almost across all applications like Mail, Message, Pages, Google Docs, and Gmail/Outlook. It also has features for, Sentence rewriting (such as professional and friendly modes) Fixing grammar mistakes Translation support I hope you will find it helpful. Feel free to share your feedback :) submitted by /u/juliarmg [link] [comments]  ( 83 min )
    How do I start (muli question post)
    How would one go upon building a "real-life" Jarvis? I am interested in learning and attempting to build an AI. I may sound and be very stupid of me but, I would like to learn to build something that can do everything on its own. Would it be possible for me to build an AI that I could teach like I would if I had a child? I have like all these ideas in my head of making AI that could do anything and starts off as a child and you teach it and teach it until it is what you want it to be. Where would I even really start from, is it even possible for me to do something like these? submitted by /u/ITZSELLABGAMING [link] [comments]  ( 84 min )
    Bootcamp or Master for learning AI?
    I've wanted to learn Data Science/AI for a long time now. My question is: is it better a bootcamp or a master for learning? In case a bootcamp was better, is there anyone you could recommend? Thanks in advance :) submitted by /u/ale3x_ [link] [comments]  ( 82 min )
    Google engineer identifies anonymous faces in WWII photos with AI facial recognition
    submitted by /u/bartturner [link] [comments]  ( 82 min )
    What kind of artificial intelligence can I use to rewrite and summarize texts, cookbooks, non-fiction books, etc.?
    If possible, the AI should remove all non-factual things from the cookbooks/non-fiction books to create a short text that is full of content and does not contain useless words or private stories to make the book more expensive. submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    When you were in school, arriving at the correct answer with the wrong method wouldn’t get you credit…
    And now one of artificial intelligence’s greatest strengths is that it can solve things, arriving at a valid answer, and we have no idea how it did it. It’s called efficient when AI does it! submitted by /u/shawster [link] [comments]  ( 83 min )
    Learning to Play Minecraft with Video PreTraining (VPT)
    submitted by /u/AChickenInAHole [link] [comments]  ( 82 min )
    It just walks!
    submitted by /u/FreeFriedMen [link] [comments]  ( 83 min )
    How the AI be walking on the 17th generation
    submitted by /u/PedroRibs [link] [comments]  ( 84 min )
  • Open

    Data & Analytics Regression Playbook: Make Your Data Work Harder…And Smarter!
    With a potential recession lurking on the horizon, 99% of companies will make the same old “safe” mistakes: hunker down, let people go, shrink, and hope to hold on for dear life. However, growth-oriented organizations will see this as a business opportunity – an opportunity to leverage their data to “do more with less”.  You… Read More »Data & Analytics Regression Playbook: Make Your Data Work Harder…And Smarter! The post Data & Analytics Regression Playbook: Make Your Data Work Harder…And Smarter! appeared first on Data Science Central.  ( 22 min )
  • Open

    [R] Theoretical Open Research Areas
    Hello everyone, my goal is to do research in the field of machine learning for motion planning/robotics in general. I'm really interested in the theoretical/mathematical side of the field. However I noticed that the majority of the field consists of very experimental papers where architectures are built and bench-marked without any thorough underlying theory. So my questions is: Are there any theoretical research areas in machine learning for motion planning/robotics in general? It would be nice if someone could also give me some labs/researchers working in that direction. ​ Thank you very much. submitted by /u/-aplusib- [link] [comments]  ( 84 min )
    "A Path Towards Autonomous Machine Intelligence" - Yann LeCun
    submitted by /u/s7v7nsilver [link] [comments]  ( 84 min )
    [R] Can I use whole-protein embeddings on isolated domains?
    I'm interested in studying properties of particular protein domains. One idea is to take advantage of state-of-the-art protein embedding models, such as this, most of which are based on transformers. Some of the domains I'm studying are found in large proteins, which have multiple other domains in the same chain. Therefore, I believe it might be more informative to obtain embeddings not of each protein as a whole, but just the domains. However, I worry that the embeddings would be all off, since the model expects a complete sequence. Has anyone tried this before? Are the pre-trained domain-level embeddings? submitted by /u/OmOshIroIdEs [link] [comments]  ( 84 min )
    [N] Inverse Scaling Prize: $250k in prizes for finding tasks where larger language models do worse
    We're used to finding that task performance scales well with large increases in sizes of language models. But for real-world applications, it's also very meaningful to search for failure cases preemptively to fix the underlying issues. Can you find and convincingly demonstrate these failure cases where language models scale inversely, with larger models behaving worse? You don't necessarily need to have extra deep knowledge of ML or language models in order to participate and win, because all models are frozen and you only need to come up with the right data. Check out these resources to learn more! Announcement Twitter thread, contest details on Github. The deadline for the first round of the contest is August 27, 2022. submitted by /u/alexlyzhov [link] [comments]  ( 87 min )
    [Discussion] [computer vision] Instant NeRF create quality depth maps?
    Surprised I haven't seen more chatter about this. What do you think about Nvidia's instant Nerf which turns 2d into 3d based on these techniques https://arxiv.org/abs/2003.10016 Does the output of a NeRF give a depth map that's comparable to what you'd get from a Kinect? Can these be used to create 3D models one would use in Unreal or Blender? submitted by /u/KalloDotIO [link] [comments]  ( 84 min )
    [D] Do you have any suggestions for a crowd-sourced annotation tool?
    We're currently doing research on computational social science, specifically on online toxicity. We have lots of text data, but we don't have annotations. As part of the research, we are thinking of annotating the text using a crowd-sourcing approach. Do any of you know of any open-source tool that we could employ to ease up the process? submitted by /u/vigneshwaranpersonal [link] [comments]  ( 84 min )
    [D] Stack - Seamless data collaboration and versioning
    Hey r/MachineLearning! We are the co-founders of Stack, a hub for data collaboration and versioning. We are developing this tool to help ML teams automatically track changes in their data seamlessly. We are opening a waiting list for our beta, which we aim to release soon. You can sign up at: https://www.getstack.ai/ We are also actively looking for feedback. Feel free to share any comments or thoughts! submitted by /u/baceituno [link] [comments]  ( 84 min )
    [P] I published a tutorial about ML model deployment
    The deployment of ML models in production is a delicate process filled with challenges. You can deploy a model via a REST API, on an edge device, or as as an off-line unit used for batch processing. You can build the deployment pipeline from scratch, or use ML deployment frameworks. In my new mini-series, you'll learn best practices to deploy your ML models. I try to concentrate everything in 2 videos, to keep the series short and sweet. The first video provides a theoretical overview of ML deployment. You'll learn about: Different strategies to deploy ML in production. The main ML deployment tools on the market (TF Serving, MLFlow Model, Seldon Deploy, KServe from Kubeflow). BentoML and its features. Here's the video: https://www.youtube.com/watch?v=Mrv3CZNWYEg submitted by /u/diabulusInMusica [link] [comments]  ( 85 min )
    [D] Has anyone trained the latent diffusion models by OpenAI(CompVis)? Need some help
    I am trying to train a latent-diffusion model by following the instructions on the repo, however I am running into errors while sampling from the checkpointed models. Can someone help? I am getting Errors while trying to sample using sample_diffusion.py from a custom model trained on LSUN churches submitted by /u/icelebratefestivus [link] [comments]  ( 85 min )
    [D] IBM Zurich Research Plagiarised Our Paper and got it published on CVPR 2022. Is "copy texts" is plagiarism, "copy idea" is not plagiarism?
    I am Xianbiao Qi, a computer vision researcher with more than ten years of research experience. I am writing this blog to complain of a serious case of deliberate plagiarism of our paper by the employees from IBM Zurich Research. They did not copy texts, they copied the idea. Our preprint paper on Arxiv is "Jiaquan Ye, Xianbiao Qi, Yelin He, and etc."PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML." arXiv preprint arXiv:2105.01848, May 2021" and the code was also released. Our paper (Ye et al. arXiv: 2105.01848) was plagiarised by a team in IBM Zurich Research: "Ahmed Nassar, Nikolaos Livathinos, Maksym Lysak, and Peter Staar, "TableFormer: Table Structure Understanding with Transformers." In Proceedings of the IE…  ( 97 min )
    [D] State-of-the-art permutation-invariant graph embeddings
    Suppose I have a data set consisting of weighted undirected simple graphs. I would like to learn a vector representation of these graphs. What are the state-of-the-art (2022) architectures/methods for learning such representations? Ideally, the representations are permutation-invariant. For what it's worth, I am only interested in the case where graphs (vertices, edges, and their respective weights) are fully observed; I'm not interested cases unobserved nodes. An additional requirement is the embedding must have a lower dimension that the number of nodes. submitted by /u/heylibrarian [link] [comments]  ( 87 min )
    [P] Skipgram: neural network instead of lookup table
    I'm looking for papers which use the skipgram model but instead of a lookup table they use a neural network. The use case is instead of sentences of words I want to use sequences of human behavior where additional information is available, e.g. think sequences of visited Amazon products. Cold-start also happens to be very common and I'm thinking that using a neural network instead of lookup embeddings table would be better. Updated with more context: The typical usage of skip gram is for learning word embedding as in text where each word has an embedding which is learned through skipgram. However there is nothing limiting the usage of skipgram for non-text cases. A popular way to use skipgram in i2i recommendation systems is to treat a session of products browsed by the user as a sequence and to have an embedding per product. (Eg see KDD 2018 winning paper from Airbnb) However, the question I have here is instead of having one embedding per product can we instead use a neural network where the output layer is the embedding layer. This way we can backprop through the neural network. The reason is we have more information for products than we do for words submitted by /u/curiousML5 [link] [comments]  ( 86 min )
    [D] For perciever (IO) with single-channel audio, are position encodings even necessary?
    I've been looking into using the Perciever for a project that involves single-channel (mono) audio. From the existing implementations and tutorials, I can't find one that only does audio. It seems like in the papers they rearrange the audio into patches and add position encodings, but this is a hack to bring the audio modality into the same size tensor as other modalities. If only using 1d audio is there any need at all for position encodings at all? submitted by /u/WigglyHypersurface [link] [comments]  ( 84 min )
  • Open

    Using autograd in TensorFlow to Solve a Regression Problem
    We usually use TensorFlow to build a neural network. However, TensorFlow is not limited to this. Behind the scene, TensorFlow is a tensor library with automatic differentiation capability. Hence we can easily use it to solve a numerical optimization problem with gradient descent. In this post, we are going to show how TensorFlow’s automatic differentiation […] The post Using autograd in TensorFlow to Solve a Regression Problem appeared first on Machine Learning Mastery.  ( 16 min )
  • Open

    What is Naive Bayes?
    An introduction to machine learning algorithms  ( 8 min )
  • Open

    Generating Images from Text Prompts with VQGAN-Clip, Python, and TensorFlow [TUT]
    View the tutorial here: HERE This tutorial teaches you how to convert any text prompt to an image using VQGAN-Clip. For example you could use the prompt "A spray painting of a waiting computer and a bedroom in the style of Edgar Degas and Art Nouveau". This would generate the following image: https://imgur.com/J3qGlc4 Let me know if you have any questions or comments. submitted by /u/mshriver2 [link] [comments]  ( 83 min )
    Object Localization from scratch TF2
    Object localization trained from scratch for emoji dataset in TensorFlow 2.8. Getting an IoU = 0.5969 and classification output accuracy = 100%. The code can be referred here. Though in fairness, I am using only 9 classes out of the emoji dataset. Thoughts? submitted by /u/grid_world [link] [comments]  ( 82 min )
    Machine Learning AI Goes Through Race Track
    submitted by /u/Plazmeer [link] [comments]  ( 82 min )
  • Open

    [ReReading Reinforcment Learning by Sutton and Barton] Chapter 1 - Introduction
    As some people liked the idea, let's read this together! :) ​ As mentioned in the previous post, the plan is to read one chapter per week-new chapters on mondays-,so it will be a 17 week endeavour. For those who don't know: the latest version of the book can be found here for free: http://incompleteideas.net/book/the-book-2nd.html. Code, Errata and other materials can be found as well. ​ The first week starts off mildly with the introduction chapter with only 13 pages. It may be worthwhile to use that week to think about how you want to read the book (Noteworthy book summary on reading well: https://en.wikipedia.org/wiki/How_to_Read_a_Book) and what you want to do with what you read. Personally, I'm planning to focus on getting the facts written down as anki flashcards (I can make them available online if people are interested) and following the math and algorithms by hand, so I'll get a notebook with lots of space for errors... Also some people asked for a Discord server to connect with others, but I personally have no idea how to moderate a discord server. Dexdev08 was so kind to recommend the RL Group Discord Server (https://discord.gg/RGsYwkJY) and asked there if we could get a channel for our cause (no answer on that yet, though). I hope that satisfies the need for a Discord server. ​ Happy Reading, I hope for some lively discussions. :) submitted by /u/Accomplished-Ninja31 [link] [comments]  ( 84 min )
    In MADDPG paper, there is a line "The algorithm does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents..". Can someone explain what does differentiable means.
    Also i am getting confused, how does the equality hold here? https://preview.redd.it/jd3m2plza6891.png?width=820&format=png&auto=webp&s=55a3900aa9bd1e50dff31673d0150f6c14acd030 submitted by /u/aabra__ka__daabra [link] [comments]  ( 85 min )
    (Re)Reading Reinforcment Learning by Sutton and Barton
    I'm going to reread the book from start to finish again, maybe some people want to join? I will go for one chapter per week. If people want to join and discuss (and perhaps share notes?), I'd create a new post dedicated to that end every monday. What do you think? ​ Edit: Here we go! submitted by /u/Accomplished-Ninja31 [link] [comments]  ( 84 min )
    Mujoco Mesh: how can I rotate the orientation of the middle segment in reference to it‘s geometry?
    submitted by /u/disdisinform [link] [comments]  ( 85 min )
  • Open

    Inspect your data labels with a visual, no code tool to create high-quality training datasets with Amazon SageMaker Ground Truth Plus
    Launched at AWS re:Invent 2021, Amazon SageMaker Ground Truth Plus helps you create high-quality training datasets by removing the undifferentiated heavy lifting associated with building data labeling applications and managing the labeling workforce. All you do is share data along with labeling requirements, and Ground Truth Plus sets up and manages your data labeling workflow […]  ( 6 min )
  • Open

    NASA and conformal maps
    A couple years ago I wrote about how NASA was interested in regions bounded by curves of the form For example, here’s a plot for A = 2, B = 1, α = 2.5 and β = 6. That post mentioned a technical report from NASA that explains why these shapes are important in application, […] NASA and conformal maps first appeared on John D. Cook.  ( 6 min )
  • Open

    Megapixel Image Generation with Step-Unrolled Denoising Autoencoders. (arXiv:2206.12351v1 [cs.CV])
    An ongoing trend in generative modelling research has been to push sample resolutions higher whilst simultaneously reducing computational requirements for training and sampling. We aim to push this trend further via the combination of techniques - each component representing the current pinnacle of efficiency in their respective areas. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Unexpectedly, our method highlights weaknesses in the original formulation of hourglass transformers when applied to multidimensional data. In light of this, we propose modifications to the resampling mechanism, applicable in any task applying hierarchical transformers to multidimensional data. Additionally, we demonstrate the scalability of SUNDAE to long sequence lengths - four times longer than prior work. Our proposed framework scales to high-resolutions ($1024 \times 1024$) and trains quickly (2-4 days). Crucially, the trained model produces diverse and realistic megapixel samples in approximately 2 seconds on a consumer-grade GPU (GTX 1080Ti). In general, the framework is flexible: supporting an arbitrary number of sampling steps, sample-wise self-stopping, self-correction capabilities, conditional generation, and a NAR formulation that allows for arbitrary inpainting masks. We obtain FID scores of 10.56 on FFHQ256 - close to the original VQ-GAN in less than half the sampling steps - and 21.85 on FFHQ1024 in only 100 sampling steps.
    Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning. (arXiv:2206.12030v1 [cs.LG])
    It has been a recent trend to leverage the power of supervised learning (SL) towards more effective reinforcement learning (RL) methods. We propose a novel phasic approach by alternating online RL and offline SL for tackling sparse-reward goal-conditioned problems. In the online phase, we perform RL training and collect rollout data while in the offline phase, we perform SL on those successful trajectories from the dataset. To further improve sample efficiency, we adopt additional techniques in the online phase including task reduction to generate more feasible trajectories and a value-difference-based intrinsic reward to alleviate the sparse-reward issue. We call this overall algorithm, PhAsic self-Imitative Reduction (PAIR). PAIR substantially outperforms both non-phasic RL and phasic SL baselines on sparse-reward goal-conditioned robotic control problems, including a challenging stacking task. PAIR is the first RL method that learns to stack 6 cubes with only 0/1 success rewards from scratch.
    STREAMLINE: A Simple, Transparent, End-To-End Automated Machine Learning Pipeline Facilitating Data Analysis and Algorithm Comparison. (arXiv:2206.12002v1 [cs.LG])
    Machine learning (ML) offers powerful methods for detecting and modeling associations often in data with large feature spaces and complex associations. Many useful tools/packages (e.g. scikit-learn) have been developed to make the various elements of data handling, processing, modeling, and interpretation accessible. However, it is not trivial for most investigators to assemble these elements into a rigorous, replicatable, unbiased, and effective data analysis pipeline. Automated machine learning (AutoML) seeks to address these issues by simplifying the process of ML analysis for all. Here, we introduce STREAMLINE, a simple, transparent, end-to-end AutoML pipeline designed as a framework to easily conduct rigorous ML modeling and analysis (limited initially to binary classification). STREAMLINE is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools. It is unique among other autoML tools by offering a fully transparent and consistent baseline of comparison using a carefully designed series of pipeline elements including: (1) exploratory analysis, (2) basic data cleaning, (3) cross validation partitioning, (4) data scaling and imputation, (5) filter-based feature importance estimation, (6) collective feature selection, (7) ML modeling with `Optuna' hyperparameter optimization across 15 established algorithms (including less well-known Genetic Programming and rule-based ML), (8) evaluation across 16 classification metrics, (9) model feature importance estimation, (10) statistical significance comparisons, and (11) automatically exporting all results, plots, a PDF summary report, and models that can be easily applied to replication data.
    Debiasing Learning for Membership Inference Attacks Against Recommender Systems. (arXiv:2206.12401v1 [cs.IR])
    Learned recommender systems may inadvertently leak information about their training data, leading to privacy violations. We investigate privacy threats faced by recommender systems through the lens of membership inference. In such attacks, an adversary aims to infer whether a user's data is used to train the target recommender. To achieve this, previous work has used a shadow recommender to derive training data for the attack model, and then predicts the membership by calculating difference vectors between users' historical interactions and recommended items. State-of-the-art methods face two challenging problems: (1) training data for the attack model is biased due to the gap between shadow and target recommenders, and (2) hidden states in recommenders are not observational, resulting in inaccurate estimations of difference vectors. To address the above limitations, we propose a Debiasing Learning for Membership Inference Attacks against recommender systems (DL-MIA) framework that has four main components: (1) a difference vector generator, (2) a disentangled encoder, (3) a weight estimator, and (4) an attack model. To mitigate the gap between recommenders, a variational auto-encoder (VAE) based disentangled encoder is devised to identify recommender invariant and specific features. To reduce the estimation bias, we design a weight estimator, assigning a truth-level score for each difference vector to indicate estimation accuracy. We evaluate DL-MIA against both general recommenders and sequential recommenders on three real-world datasets. Experimental results show that DL-MIA effectively alleviates training and estimation biases simultaneously, and achieves state-of-the-art attack performance.
    RankSim: Ranking Similarity Regularization for Deep Imbalanced Regression. (arXiv:2205.15236v2 [cs.LG] UPDATED)
    Data imbalance, in which a plurality of the data samples come from a small proportion of labels, poses a challenge in training deep neural networks. Unlike classification, in regression the labels are continuous, potentially boundless, and form a natural ordering. These distinct features of regression call for new techniques that leverage the additional information encoded in label-space relationships. This paper presents the RankSim (ranking similarity) regularizer for deep imbalanced regression, which encodes an inductive bias that samples that are closer in label space should also be closer in feature space. In contrast to recent distribution smoothing based approaches, RankSim captures both nearby and distant relationships: for a given data sample, RankSim encourages the sorted list of its neighbors in label space to match the sorted list of its neighbors in feature space. RankSim is complementary to conventional imbalanced learning techniques, including re-weighting, two-stage training, and distribution smoothing, and lifts the state-of-the-art performance on three imbalanced regression benchmarks: IMDB-WIKI-DIR, AgeDB-DIR, and STS-B-DIR.
    SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting. (arXiv:2206.12190v1 [cs.LG])
    Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM) is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams. We therefore propose SECLEDS, a streaming variant of the k-medoids algorithm with constant memory footprint. SECLEDS has two unique properties: i) it uses multiple medoids per cluster, producing stable high-quality clusters, and ii) it handles concept drift using an intuitive Medoid Voting scheme for approximating cluster distances. Unlike existing adaptive algorithms that create new clusters for new concepts, SECLEDS follows a fundamentally different approach, where the clusters themselves evolve with an evolving stream. Using real and synthetic datasets, we empirically demonstrate that SECLEDS produces high-quality clusters regardless of drift, stream size, data dimensionality, and number of clusters. We compare against three popular stream and batch clustering algorithms. The state-of-the-art BanditPAM is used as an offline benchmark. SECLEDS achieves comparable F1 score to BanditPAM while reducing the number of required distance computations by 83.7%. Importantly, SECLEDS outperforms all baselines by 138.7% when the stream contains drift. We also cluster real network traffic, and provide evidence that SECLEDS can support network bandwidths of up to 1.08 Gbps while using the (expensive) dynamic time warping distance.
    Dynamic network congestion pricing based on deep reinforcement learning. (arXiv:2206.12188v1 [eess.SY])
    Traffic congestion is a serious problem in urban areas. Dynamic congestion pricing is one of the useful schemes to eliminate traffic congestion in strategic scale. However, in the reality, an optimal dynamic congestion pricing is very difficult or impossible to determine theoretically, because road networks are usually large and complicated, and behavior of road users is uncertain. To account for this challenge, this work proposes a dynamic congestion pricing method using deep reinforcement learning (DRL). It is designed to eliminate traffic congestion based on observable data in general large-scale road networks, by leveraging the data-driven nature of deep reinforcement learning. One of the novel elements of the proposed method is the distributed and cooperative learning scheme. Specifically, the DRL is implemented by a spatial-temporally distributed manner, and cooperation among DRL agents is established by novel techniques we call spatially shared reward and temporally switching learning. It enables fast and computationally efficient learning in large-scale networks. The numerical experiments using Sioux Falls Network showed that the proposed method works well thanks to the novel learning scheme.
    The Digital Twin Landscape at the Crossroads of Predictive Maintenance, Machine Learning and Physics Based Modeling. (arXiv:2206.10462v2 [cs.LG] UPDATED)
    The concept of a digital twin has exploded in popularity over the past decade, yet confusion around its plurality of definitions, its novelty as a new technology, and its practical applicability still exists, all despite numerous reviews, surveys, and press releases. The history of the term digital twin is explored, as well as its initial context in the fields of product life cycle management, asset maintenance, and equipment fleet management, operations, and planning. A definition for a minimally viable framework to utilize a digital twin is also provided based on seven essential elements. A brief tour through DT applications and industries where DT methods are employed is also outlined. The application of a digital twin framework is highlighted in the field of predictive maintenance, and its extensions utilizing machine learning and physics based modeling. Employing the combination of machine learning and physics based modeling to form hybrid digital twin frameworks, may synergistically alleviate the shortcomings of each method when used in isolation. Key challenges of implementing digital twin models in practice are additionally discussed. As digital twin technology experiences rapid growth and as it matures, its great promise to substantially enhance tools and solutions for intelligent upkeep of complex equipment, are expected to materialize.
    Data Leakage in Federated Averaging. (arXiv:2206.12395v1 [cs.LG])
    Recent attacks have shown that user data can be reconstructed from FedSGD updates, thus breaking privacy. However, these attacks are of limited practical relevance as federated learning typically uses the FedAvg algorithm. It is generally accepted that reconstructing data from FedAvg updates is much harder than FedSGD as: (i) there are unobserved intermediate weight updates, (ii) the order of inputs matters, and (iii) the order of labels changes every epoch. In this work, we propose a new optimization-based attack which successfully attacks FedAvg by addressing the above challenges. First, we solve the optimization problem using automatic differentiation that forces a simulation of the client's update for the reconstructed labels and inputs so as to match the received client update. Second, we address the unknown input order by treating images at different epochs as independent during optimization, while relating them with a permutation invariant prior. Third, we reconstruct the labels by estimating the parameters of existing FedSGD attacks at every FedAvg step. On the popular FEMNIST dataset, we demonstrate that on average we successfully reconstruct >45% of the client's images from realistic FedAvg updates computed on 10 local epochs of 10 batches each with 5 images, compared to only <10% using the baseline. These findings indicate that many real-world federated learning implementations based on FedAvg are vulnerable.
    Predicting the Stability of Hierarchical Triple Systems with Convolutional Neural Networks. (arXiv:2206.12402v1 [astro-ph.EP])
    Understanding the long-term evolution of hierarchical triple systems is challenging due to its inherent chaotic nature, and it requires computationally expensive simulations. Here we propose a convolutional neural network model to predict the stability of hierarchical triples by looking at their evolution during the first $5 \times 10^5$ inner binary orbits. We employ the regularized few-body code \textsc{tsunami} to simulate $5\times 10^6$ hierarchical triples, from which we generate a large training and test dataset. We develop twelve different network configurations that use different combinations of the triples' orbital elements and compare their performances. Our best model uses 6 time-series, namely, the semimajor axes ratio, the inner and outer eccentricities, the mutual inclination and the arguments of pericenter. This model achieves an area under the curve of over $95\%$ and informs of the relevant parameters to study triple systems stability. All trained models are made publicly available, allowing to predict the stability of hierarchical triple systems $200$ times faster than pure $N$-body methods.
    Multi-Exit Semantic Segmentation Networks. (arXiv:2106.03527v2 [cs.CV] UPDATED)
    Semantic segmentation arises as the backbone of many vision systems, spanning from self-driving cars and robot navigation to augmented reality and teleconferencing. Frequently operating under stringent latency constraints within a limited resource envelope, optimising for efficient execution becomes important. At the same time, the heterogeneous capabilities of the target platforms and diverse constraints of different applications require the design and training of multiple target-specific segmentation models, leading to excessive maintenance costs. To this end, we propose a framework for converting state-of-the-art segmentation CNNs to Multi-Exit Semantic Segmentation (MESS) networks: specially trained models that employ parametrised early exits along their depth to i) dynamically save computation during inference on easier samples and ii) save training and maintenance cost by offering a post-training customisable speed-accuracy trade-off. Designing and training such networks naively can hurt performance. Thus, we propose novel two-staged training scheme for multi-exit networks. Furthermore, the parametrisation of MESS enables co-optimising the number, placement and architecture of the attached segmentation heads along with the exit policy, upon deployment via exhaustive search in <1GPUh. This allows MESS to rapidly adapt to the device capabilities and application requirements for each target use-case, offering a train-once-deploy-everywhere solution. MESS variants achieve latency gains of up to 2.83x with the same accuracy, or 5.33 pp higher accuracy for the same computational budget, compared to the original backbone network. Lastly, MESS delivers orders of magnitude faster architecture selection, compared to state-of-the-art techniques.
    Analyzing the impact of SARS-CoV-2 variants on respiratory sound signals. (arXiv:2206.12309v1 [eess.AS])
    The COVID-19 outbreak resulted in multiple waves of infections that have been associated with different SARS-CoV-2 variants. Studies have reported differential impact of the variants on respiratory health of patients. We explore whether acoustic signals, collected from COVID-19 subjects, show computationally distinguishable acoustic patterns suggesting a possibility to predict the underlying virus variant. We analyze the Coswara dataset which is collected from three subject pools, namely, i) healthy, ii) COVID-19 subjects recorded during the delta variant dominant period, and iii) data from COVID-19 subjects recorded during the omicron surge. Our findings suggest that multiple sound categories, such as cough, breathing, and speech, indicate significant acoustic feature differences when comparing COVID-19 subjects with omicron and delta variants. The classification areas-under-the-curve are significantly above chance for differentiating subjects infected by omicron from those infected by delta. Using a score fusion from multiple sound categories, we obtained an area-under-the-curve of 89% and 52.4% sensitivity at 95% specificity. Additionally, a hierarchical three class approach was used to classify the acoustic data into healthy and COVID-19 positive, and further COVID-19 subjects into delta and omicron variants providing high level of 3-class classification accuracy. These results suggest new ways for designing sound based COVID-19 diagnosis approaches.
    MSR-NV: Neural Vocoder Using Multiple Sampling Rates. (arXiv:2109.13714v3 [eess.AS] UPDATED)
    The development of neural vocoders (NVs) has resulted in the high-quality and fast generation of waveforms. However, conventional NVs target a single sampling rate and require re-training when applied to different sampling rates. A suitable sampling rate varies from application to application due to the trade-off between speech quality and generation speed. In this study, we propose a method to handle multiple sampling rates in a single NV, called the MSR-NV. By generating waveforms step-by-step starting from a low sampling rate, MSR-NV can efficiently learn the characteristics of each frequency band and synthesize high-quality speech at multiple sampling rates. It can be regarded as an extension of the previously proposed NVs, and in this study, we extend the structure of Parallel WaveGAN (PWG). Experimental evaluation results demonstrate that the proposed method achieves remarkably higher subjective quality than the original PWG trained separately at 16, 24, and 48 kHz, without increasing the inference time. We also show that MSR-NV can leverage speech with lower sampling rates to further improve the quality of the synthetic speech.
    Provably Confidential Language Modelling. (arXiv:2205.01863v2 [cs.CL] UPDATED)
    Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.
    Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. (arXiv:2106.04156v7 [cs.LG] UPDATED)
    Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together while keeping negative pairs far apart. Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of the positive pairs given the same class label, but recent empirical applications use heavily correlated positive pairs (i.e., data augmentations of the same image). Our work analyzes contrastive learning without assuming conditional independence of positive pairs using a novel concept of the augmentation graph on data. Edges in this graph connect augmentations of the same data, and ground-truth classes naturally form connected sub-graphs. We propose a loss that performs spectral decomposition on the population augmentation graph and can be succinctly written as a contrastive learning objective on neural net representations. Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss. Empirically, the features learned by our objective can match or outperform several strong baselines on benchmark vision datasets. In all, this work provides the first provable analysis for contrastive learning where guarantees for linear probe evaluation can apply to realistic empirical settings.
    How to train accurate BNNs for embedded systems?. (arXiv:2206.12322v1 [cs.LG])
    A key enabler of deploying convolutional neural networks on resource-constrained embedded systems is the binary neural network (BNN). BNNs save on memory and simplify computation by binarizing both features and weights. Unfortunately, binarization is inevitably accompanied by a severe decrease in accuracy. To reduce the accuracy gap between binary and full-precision networks, many repair methods have been proposed in the recent past, which we have classified and put into a single overview in this chapter. The repair methods are divided into two main branches, training techniques and network topology changes, which can further be split into smaller categories. The latter category introduces additional cost (energy consumption or additional area) for an embedded system, while the former does not. From our overview, we observe that progress has been made in reducing the accuracy gap, but BNN papers are not aligned on what repair methods should be used to get highly accurate BNNs. Therefore, this chapter contains an empirical review that evaluates the benefits of many repair methods in isolation over the ResNet-20\&CIFAR10 and ResNet-18\&CIFAR100 benchmarks. We found three repair categories most beneficial: feature binarizer, feature normalization, and double residual. Based on this review we discuss future directions and research opportunities. We sketch the benefit and costs associated with BNNs on embedded systems because it remains to be seen whether BNNs will be able to close the accuracy gap while staying highly energy-efficient on resource-constrained embedded systems.
    Graph-Coupled Oscillator Networks. (arXiv:2202.02296v2 [cs.LG] UPDATED)
    We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear controlled and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Moreover, we prove that GraphCON mitigates the exploding and vanishing gradients problem to facilitate training of deep multi-layer GNNs. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.
    SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech. (arXiv:2206.12132v1 [eess.AS])
    In this paper, we present SANE-TTS, a stable and natural end-to-end multilingual TTS model. By the difficulty of obtaining multilingual corpus for given speaker, training multilingual TTS model with monolingual corpora is unavoidable. We introduce speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models. Furthermore, by adding speaker regularization loss, replacing speaker embedding with zero vector in duration predictor stabilizes cross-lingual inference. With this replacement, our model generates speeches with moderate rhythm regardless of source speaker in cross-lingual synthesis. In MOS evaluation, SANE-TTS achieves naturalness score above 3.80 both in cross-lingual and intralingual synthesis, where the ground truth score is 3.99. Also, SANE-TTS maintains speaker similarity close to that of ground truth even in cross-lingual inference. Audio samples are available on our web page.
    Achievement and Fragility of Long-term Equitability. (arXiv:2206.12333v1 [math.OC])
    Equipping current decision-making tools with notions of fairness, equitability, or other ethically motivated outcomes, is one of the top priorities in recent research efforts in machine learning, AI, and optimization. In this paper, we investigate how to allocate limited resources to {locally interacting} communities in a way to maximize a pertinent notion of equitability. In particular, we look at the dynamic setting where the allocation is repeated across multiple periods (e.g., yearly), the local communities evolve in the meantime (driven by the provided allocation), and the allocations are modulated by feedback coming from the communities themselves. We employ recent mathematical tools stemming from data-driven feedback online optimization, by which communities can learn their (possibly unknown) evolution, satisfaction, as well as they can share information with the deciding bodies. We design dynamic policies that converge to an allocation that maximize equitability in the long term. We further demonstrate our model and methodology with realistic examples of healthcare and education subsidies design in Sub-Saharian countries. One of the key empirical takeaways from our setting is that long-term equitability is fragile, in the sense that it can be easily lost when deciding bodies weigh in other factors (e.g., equality in allocation) in the allocation strategy. Moreover, a naive compromise, while not providing significant advantage to the communities, can promote inequality in social outcomes.
    Animal Behavior Classification via Deep Learning on Embedded Systems. (arXiv:2111.12295v2 [cs.LG] UPDATED)
    We develop an end-to-end deep-neural-network-based algorithm for classifying animal behavior using accelerometry data on the embedded system of an artificial intelligence of things (AIoT) device installed in a wearable collar tag. The proposed algorithm jointly performs feature extraction and classification utilizing a set of infinite-impulse-response (IIR) and finite-impulse-response (FIR) filters together with a multilayer perceptron. The utilized IIR and FIR filters can be viewed as specific types of recurrent and convolutional neural network layers, respectively. We evaluate the performance of the proposed algorithm via two real-world datasets collected from total eighteen grazing beef cattle using collar tags. The results show that the proposed algorithm offers good intra- and inter-dataset classification accuracy and outperforms its closest contenders including two state-of-the-art convolutional-neural-network-based time-series classification algorithms, which are significantly more complex. We implement the proposed algorithm on the embedded system of the utilized collar tags' AIoT device to perform in-situ classification of animal behavior. We achieve real-time in-situ behavior inference from accelerometry data without imposing any strain on the available computational, memory, or energy resources of the embedded system.
    Using Autoencoders on Differentially Private Federated Learning GANs. (arXiv:2206.12270v1 [cs.LG])
    Machine learning has been applied to almost all fields of computer science over the past decades. The introduction of GANs allowed for new possibilities in fields of medical research and text prediction. However, these new fields work with ever more privacy-sensitive data. In order to maintain user privacy, a combination of federated learning, differential privacy and GANs can be used to work with private data without giving away a users' privacy. Recently, two implementations of such combinations have been published: DP-Fed-Avg GAN and GS-WGAN. This paper compares their performance and introduces an alternative version of DP-Fed-Avg GAN that makes use of denoising techniques to combat the loss in accuracy that generally occurs when applying differential privacy and federated learning to GANs. We also compare the novel adaptation of denoised DP-Fed-Avg GAN to the state-of-the-art implementations in this field.
    NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling. (arXiv:2104.02321v2 [eess.AS] CROSS LISTED)
    In this work, we introduce NU-Wave, the first neural audio upsampling model to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs, while prior works could generate only up to 16kHz. NU-Wave is the first diffusion probabilistic model for audio super-resolution which is engineered based on neural vocoders. NU-Wave generates high-quality audio that achieves high performance in terms of signal-to-noise ratio (SNR), log-spectral distance (LSD), and accuracy of the ABX test. In all cases, NU-Wave outperforms the baseline models despite the substantially smaller model capacity (3.0M parameters) than baselines (5.4-21%). The audio samples of our model are available at https://mindslab-ai.github.io/nuwave, and the code will be made available soon.
    Score-based Generative Models for Calorimeter Shower Simulation. (arXiv:2206.11898v1 [hep-ph])
    Score-based generative models are a new class of generative algorithms that have been shown to produce realistic images even in high dimensional spaces, currently surpassing other state-of-the-art models for different benchmark categories and applications. In this work we introduce CaloScore, a score-based generative model for collider physics applied to calorimeter shower generation. Three different diffusion models are investigated using the Fast Calorimeter Simulation Challenge 2022 dataset. CaloScore is the first application of a score-based generative model in collider physics and is able to produce high-fidelity calorimeter images for all datasets, providing an alternative paradigm for calorimeter shower simulation.
    Federated learning: Applications, challenges and future directions. (arXiv:2205.09513v2 [cs.LG] UPDATED)
    Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context of FL, several privacy methods are described, including secure multiparty computation, homomorphic encryption, differential privacy, and stochastic gradient descent. Furthermore, a review of various FL classes, such as horizontal and vertical FL and federated transfer learning, is provided. FL has applications in wireless communication, service recommendation, intelligent medical diagnosis systems, and healthcare, all of which are discussed in this paper. We also present a thorough review of existing FL challenges, such as privacy protection, communication cost, system heterogeneity, and unreliable model upload, followed by future research directions.
    On Certifying and Improving Generalization to Unseen Domains. (arXiv:2206.12364v1 [cs.LG])
    Domain Generalization (DG) aims to learn models whose performance remains high on unseen domains encountered at test-time by using data from multiple related source domains. Many existing DG algorithms reduce the divergence between source distributions in a representation space to potentially align the unseen domain close to the sources. This is motivated by the analysis that explains generalization to unseen domains using distributional distance (such as the Wasserstein distance) to the sources. However, due to the openness of the DG objective, it is challenging to evaluate DG algorithms comprehensively using a few benchmark datasets. In particular, we demonstrate that the accuracy of the models trained with DG methods varies significantly across unseen domains, generated from popular benchmark datasets. This highlights that the performance of DG methods on a few benchmark datasets may not be representative of their performance on unseen domains in the wild. To overcome this roadblock, we propose a universal certification framework based on distributionally robust optimization (DRO) that can efficiently certify the worst-case performance of any DG method. This enables a data-independent evaluation of a DG method complementary to the empirical evaluations on benchmark datasets. Furthermore, we propose a training algorithm that can be used with any DG method to provably improve their certified performance. Our empirical evaluation demonstrates the effectiveness of our method at significantly improving the worst-case loss (i.e., reducing the risk of failure of these models in the wild) without incurring a significant performance drop on benchmark datasets.
    AdAUC: End-to-end Adversarial AUC Optimization Against Long-tail Problems. (arXiv:2206.12169v1 [cs.LG])
    It is well-known that deep learning models are vulnerable to adversarial examples. Existing studies of adversarial training have made great progress against this challenge. As a typical trait, they often assume that the class distribution is overall balanced. However, long-tail datasets are ubiquitous in a wide spectrum of applications, where the amount of head class instances is larger than the tail classes. Under such a scenario, AUC is a much more reasonable metric than accuracy since it is insensitive toward class distribution. Motivated by this, we present an early trial to explore adversarial training methods to optimize AUC. The main challenge lies in that the positive and negative examples are tightly coupled in the objective function. As a direct result, one cannot generate adversarial examples without a full scan of the dataset. To address this issue, based on a concavity regularization scheme, we reformulate the AUC optimization problem as a saddle point problem, where the objective becomes an instance-wise function. This leads to an end-to-end training protocol. Furthermore, we provide a convergence guarantee of the proposed algorithm. Our analysis differs from the existing studies since the algorithm is asked to generate adversarial examples by calculating the gradient of a min-max problem. Finally, the extensive experimental results show the performance and robustness of our algorithm in three long-tail datasets.
    Affinity-Aware Graph Networks. (arXiv:2206.11941v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform -- and hence a smaller receptive field -- there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks. Our architecture has lower computational complexity, while our features are invariant to the permutations of the underlying graph. The measures we compute allow the network to exploit the connectivity properties of the graph, thereby allowing us to outperform relevant benchmarks for a wide variety of tasks, often with significantly fewer message passing steps. On one of the largest publicly available graph regression datasets, OGB-LSC-PCQM4Mv1, we obtain the best known single-model validation MAE at the time of writing.
    Multi-Agent Deep Reinforcement Learning for Cost- and Delay-Sensitive Virtual Network Function Placement and Routing. (arXiv:2206.12146v1 [cs.AI])
    This paper proposes an effective and novel multiagent deep reinforcement learning (MADRL)-based method for solving the joint virtual network function (VNF) placement and routing (P&R), where multiple service requests with differentiated demands are delivered at the same time. The differentiated demands of the service requests are reflected by their delay- and cost-sensitive factors. We first construct a VNF P&R problem to jointly minimize a weighted sum of service delay and resource consumption cost, which is NP-complete. Then, the joint VNF P&R problem is decoupled into two iterative subtasks: placement subtask and routing subtask. Each subtask consists of multiple concurrent parallel sequential decision processes. By invoking the deep deterministic policy gradient method and multi-agent technique, an MADRL-P&R framework is designed to perform the two subtasks. The new joint reward and internal rewards mechanism is proposed to match the goals and constraints of the placement and routing subtasks. We also propose the parameter migration-based model-retraining method to deal with changing network topologies. Corroborated by experiments, the proposed MADRL-P&R framework is superior to its alternatives in terms of service cost and delay, and offers higher flexibility for personalized service demands. The parameter migration-based model-retraining method can efficiently accelerate convergence under moderate network topology changes.
    Learning sparse features can lead to overfitting in neural networks. (arXiv:2206.12314v1 [stat.ML])
    It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.
    Mitigating Neural Network Overconfidence with Logit Normalization. (arXiv:2205.09310v2 [cs.LG] UPDATED)
    Detecting out-of-distribution inputs is critical for safe deployment of machine learning models in the real world. However, neural networks are known to suffer from the overconfidence issue, where they produce abnormally high confidence for both in- and out-of-distribution inputs. In this work, we show that this issue can be mitigated through Logit Normalization (LogitNorm) -- a simple fix to the cross-entropy loss -- by enforcing a constant vector norm on the logits in training. Our method is motivated by the analysis that the norm of the logit keeps increasing during training, leading to overconfident output. Our key idea behind LogitNorm is thus to decouple the influence of output's norm during network optimization. Trained with LogitNorm, neural networks produce highly distinguishable confidence scores between in- and out-of-distribution data. Extensive experiments demonstrate the superiority of LogitNorm, reducing the average FPR95 by up to 42.30% on common benchmarks.
    Improved-Mask R-CNN: Towards an Accurate Generic MSK MRI instance segmentation platform (Data from the Osteoarthritis Initiative). (arXiv:2107.12889v2 [eess.IV] UPDATED)
    Objective assessment of Magnetic Resonance Imaging (MRI) scans of osteoarthritis (OA) can address the limitation of the current OA assessment. Segmentation of bone, cartilage, and joint fluid is necessary for the OA objective assessment. Most of the proposed segmentation methods are not performing instance segmentation and suffer from class imbalance problems. This study deployed Mask R-CNN instance segmentation and improved it (improved-Mask R-CNN (iMaskRCNN)) to obtain a more accurate generalized segmentation for OA-associated tissues. Training and validation of the method were performed using 500 MRI knees from the Osteoarthritis Initiative (OAI) dataset and 97 MRI scans of patients with symptomatic hip OA. Three modifications to Mask R-CNN yielded the iMaskRCNN: adding a 2nd ROIAligned block, adding an extra decoder layer to the mask-header, and connecting them by a skip connection. The results were assessed using Hausdorff distance, dice score, and coefficients of variation (CoV). The iMaskRCNN led to improved bone and cartilage segmentation compared to Mask RCNN as indicated with the increase in dice score from 95% to 98% for the femur, 95% to 97% for tibia, 71% to 80% for femoral cartilage, and 81% to 82% for tibial cartilage. For the effusion detection, dice improved with iMaskRCNN 72% versus MaskRCNN 71%. The CoV values for effusion detection between Reader1 and Mask R-CNN (0.33), Reader1 and iMaskRCNN (0.34), Reader2 and Mask R-CNN (0.22), Reader2 and iMaskRCNN (0.29) are close to CoV between two readers (0.21), indicating a high agreement between the human readers and both Mask R-CNN and iMaskRCNN. Mask R-CNN and iMaskRCNN can reliably and simultaneously extract different scale articular tissues involved in OA, forming the foundation for automated assessment of OA. The iMaskRCNN results show that the modification improved the network performance around the edges.
    Leverage Score Sampling for Tensor Product Matrices in Input Sparsity Time. (arXiv:2202.04515v2 [cs.LG] UPDATED)
    We propose an input sparsity time sampling algorithm that can spectrally approximate the Gram matrix corresponding to the $q$-fold column-wise tensor product of $q$ matrices using a nearly optimal number of samples, improving upon all previously known methods by poly$(q)$ factors. Furthermore, for the important special case of the $q$-fold self-tensoring of a dataset, which is the feature matrix of the degree-$q$ polynomial kernel, the leading term of our method's runtime is proportional to the size of the input dataset and has no dependence on $q$. Previous techniques either incur poly$(q)$ slowdowns in their runtime or remove the dependence on $q$ at the expense of having sub-optimal target dimension, and depend quadratically on the number of data-points in their runtime. Our sampling technique relies on a collection of $q$ partially correlated random projections which can be simultaneously applied to a dataset $X$ in total time that only depends on the size of $X$, and at the same time their $q$-fold Kronecker product acts as a near-isometry for any fixed vector in the column span of $X^{\otimes q}$. We also show that our sampling methods generalize to other classes of kernels beyond polynomial, such as Gaussian and Neural Tangent kernels.
    Geometric Policy Iteration for Markov Decision Processes. (arXiv:2206.05809v2 [cs.LG] UPDATED)
    Recently discovered polyhedral structures of the value function for finite state-action discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound $\mathcal{O}\left(\frac{|\mathcal{A}|}{1 - \gamma}\log \frac{1}{1-\gamma}\right)$ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.
    Towards Representative Subset Selection for Self-Supervised Speech Recognition. (arXiv:2203.09829v2 [cs.LG] UPDATED)
    Self-supervised speech recognition models require considerable labeled training data for learning high-fidelity representations for Automatic Speech Recognition (ASR) which is computationally demanding and time-consuming, thereby hindering the usage of these models in resource-constrained environments. We consider the task of identifying an optimal subset of data to train self-supervised speech models for ASR. We make a surprising observation that the dataset pruning strategies used in vision tasks for sampling the most informative examples do not perform better than random subset selection on the task of fine-tuning self-supervised ASR. We then present the COWERAGE algorithm for better subset selection in self-supervised ASR, which is based on our finding that ensuring the coverage of examples based on training Word Error Rate (WER) in the early training epochs leads to better generalization performance. Extensive experiments on the wav2vec 2.0 model and TIMIT, Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE, with up to 17% absolute WER improvement over existing dataset pruning methods and random sampling. We also demonstrate that the coverage of training instances in terms of WER ensures inclusion of phonemically diverse examples which leads to better test accuracy in self-supervised speech recognition models.
    Property Unlearning: A Defense Strategy Against Property Inference Attacks. (arXiv:2205.08821v2 [cs.CR] UPDATED)
    During the training of machine learning models, they may store or "learn" more information about the training data than what is actually needed for the prediction or classification task. This is exploited by property inference attacks which aim at extracting statistical properties from the training data of a given model without having access to the training data itself. These properties may include the quality of pictures to identify the camera model, the age distribution to reveal the target audience of a product, or the included host types to refine a malware attack in computer networks. This attack is especially accurate when the attacker has access to all model parameters, i.e., in a white-box scenario. By defending against such attacks, model owners are able to ensure that their training data, associated properties, and thus their intellectual property stays private, even if they deliberately share their models, e.g., to train collaboratively, or if models are leaked. In this paper, we introduce property unlearning, an effective defense mechanism against white-box property inference attacks, independent of the training data type, model task, or number of properties. Property unlearning mitigates property inference attacks by systematically changing the trained weights and biases of a target model such that an adversary cannot extract chosen properties. We empirically evaluate property unlearning on three different data sets, including tabular and image data, and two types of artificial neural networks. Our results show that property unlearning is both efficient and reliable to protect machine learning models against property inference attacks, with a good privacy-utility trade-off. Furthermore, our approach indicates that this mechanism is also effective to unlearn multiple properties.
    Deep learning algorithms for solving high dimensional nonlinear backward stochastic differential equations. (arXiv:2010.01319v3 [math.NA] UPDATED)
    In this work, we propose a new deep learning-based scheme for solving high dimensional nonlinear backward stochastic differential equations (BSDEs). The idea is to reformulate the problem as a global optimization, where the local loss functions are included. Essentially, we approximate the unknown solution of a BSDE using a deep neural network and its gradient with automatic differentiation. The approximations are performed by globally minimizing the quadratic local loss function defined at each time step, which always includes the terminal condition. This kind of loss functions are obtained by iterating the Euler discretization of the time integrals with the terminal condition. Our formulation can prompt the stochastic gradient descent algorithm not only to take the accuracy at each time layer into account, but also converge to a good local minima. In order to demonstrate performances of our algorithm, several high-dimensional nonlinear BSDEs including pricing problems in finance are provided.
    Correlation Clustering via Strong Triadic Closure Labeling: Fast Approximation Algorithms and Practical Lower Bounds. (arXiv:2111.10699v2 [cs.DS] UPDATED)
    Correlation clustering is a widely studied framework for clustering based on pairwise similarity and dissimilarity scores, but its best approximation algorithms rely on impractical linear programming relaxations. We present faster approximation algorithms that avoid these relaxations, for two well-studied special cases: cluster editing and cluster deletion. We accomplish this by drawing new connections to edge labeling problems related to the principle of strong triadic closure. This leads to faster and more practical linear programming algorithms, as well as extremely scalable combinatorial techniques, including the first combinatorial approximation algorithm for cluster deletion. In practice, our algorithms produce approximate solutions that nearly match the best algorithms in quality, while scaling to problems that are orders of magnitude larger.
    Deep Reinforcement Learning Guided Graph Neural Networks for Brain Network Analysis. (arXiv:2203.10093v3 [cs.LG] UPDATED)
    Modern neuroimaging techniques, such as diffusion tensor imaging (DTI) and functional magnetic resonance imaging (fMRI), enable us to model the human brain as a brain network or connectome. Capturing brain networks' structural information and hierarchical patterns is essential for understanding brain functions and disease states. Recently, the promising network representation learning capability of graph neural networks (GNNs) has prompted many GNN-based methods for brain network analysis to be proposed. Specifically, these methods apply feature aggregation and global pooling to convert brain network instances into meaningful low-dimensional representations used for downstream brain network analysis tasks. However, existing GNN-based methods often neglect that brain networks of different subjects may require various aggregation iterations and use GNN with a fixed number of layers to learn all brain networks. Therefore, how to fully release the potential of GNNs to promote brain network analysis is still non-trivial. To solve this problem, we propose a novel brain network representation framework, namely BN-GNN, which searches for the optimal GNN architecture for each brain network. Concretely, BN-GNN employs deep reinforcement learning (DRL) to train a meta-policy to automatically determine the optimal number of feature aggregations (reflected in the number of GNN layers) required for a given brain network. Extensive experiments on eight real-world brain network datasets demonstrate that our proposed BN-GNN improves the performance of traditional GNNs on different brain network analysis tasks.
    Zero-shot Transfer Learning on Heterogeneous Graphs via Knowledge Transfer Networks. (arXiv:2203.02018v3 [cs.LG] UPDATED)
    Data continuously emitted from industrial ecosystems such as social or commerce platforms are commonly represented as heterogeneous graphs (HG) composed of multiple node/edge types. State-of-the-art graph learning methods for HGs known as heterogeneous graph neural networks (HGNNs) are applied to learn deep context-informed node representations. However, many HG datasets from industrial applications suffer from label imbalance between node types. As there is no direct way to learn using labels rooted at different node types, HGNNs have been applied to only a few node types with abundant labels. We propose a zero-shot transfer learning module for HGNNs called a Knowledge Transfer Network (KTN) that transfers knowledge from label-abundant node types to zero-labeled node types through rich relational information given in the HG. KTN is derived from the theoretical relationship, which we introduce in this work, between distinct feature extractors for each node type given in an HGNN model. KTN improves the performance of 6 different types of HGNN models by up to 960% for inference on zero-labeled node types and outperforms state-of-the-art transfer learning baselines by up to 73% across 18 different transfer learning tasks on HGs.
    A Mixed-Integer Programming Approach to Training Dense Neural Networks. (arXiv:2201.00723v2 [cs.LG] UPDATED)
    Artificial Neural Networks (ANNs) are prevalent machine learning models that are applied across various real-world classification tasks. However, training ANNs is time-consuming and the resulting models take a lot of memory to deploy. In order to train more parsimonious ANNs, we propose a novel mixed-integer programming (MIP) formulation for training fully-connected ANNs. Our formulations can account for both binary and rectified linear unit (ReLU) activations, and for the use of a log-likelihood loss. We present numerical experiments comparing our MIP-based methods against existing approaches and show that we are able to achieve competitive out-of-sample performance with more parsimonious models.
    Virtual Homogeneity Learning: Defending against Data Heterogeneity in Federated Learning. (arXiv:2206.02465v2 [cs.LG] UPDATED)
    In federated learning (FL), model performance typically suffers from client drift induced by data heterogeneity, and mainstream works focus on correcting client drift. We propose a different approach named virtual homogeneity learning (VHL) to directly "rectify" the data heterogeneity. In particular, VHL conducts FL with a virtual homogeneous dataset crafted to satisfy two conditions: containing no private information and being separable. The virtual dataset can be generated from pure noise shared across clients, aiming to calibrate the features from the heterogeneous clients. Theoretically, we prove that VHL can achieve provable generalization performance on the natural distribution. Empirically, we demonstrate that VHL endows FL with drastically improved convergence speed and generalization performance. VHL is the first attempt towards using a virtual dataset to address data heterogeneity, offering new and effective means to FL.
    Regret Bounds for Noise-Free Kernel-Based Bandits. (arXiv:2002.05096v2 [stat.ML] UPDATED)
    Kernel-based bandit is an extensively studied black-box optimization problem, in which the objective function is assumed to live in a known reproducing kernel Hilbert space. While nearly optimal regret bounds (up to logarithmic factors) are established in the noisy setting, surprisingly, less is known about the noise-free setting (when the exact values of the underlying function is accessible without observation noise). We discuss several upper bounds on regret; none of which seem order optimal, and provide a conjecture on the order optimal regret bound.
    Efficient End-to-End AutoML via Scalable Search Space Decomposition. (arXiv:2206.09423v2 [cs.LG] UPDATED)
    End-to-end AutoML has attracted intensive interests from both academia and industry which automatically searches for ML pipelines in a space induced by feature engineering, algorithm/model selection, and hyper-parameter tuning. Existing AutoML systems, however, suffer from scalability issues when applying to application domains with large, high-dimensional search spaces. We present VolcanoML, a scalable and extensible framework that facilitates systematic exploration of large AutoML search spaces. VolcanoML introduces and implements basic building blocks that decompose a large search space into smaller ones, and allows users to utilize these building blocks to compose an execution plan for the AutoML problem at hand. VolcanoML further supports a Volcano-style execution model -- akin to the one supported by modern database systems -- to execute the plan constructed. Our evaluation demonstrates that, not only does VolcanoML raise the level of expressiveness for search space decomposition in AutoML, it also leads to actual findings of decomposition strategies that are significantly more efficient than the ones employed by state-of-the-art AutoML systems such as auto-sklearn. This paper is the extended version of the initial VolcanoML paper appeared in VLDB 2021.
    Inductive Biases and Variable Creation in Self-Attention Mechanisms. (arXiv:2110.10090v2 [cs.LG] UPDATED)
    Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.
    Generalizing to New Physical Systems via Context-Informed Dynamics Model. (arXiv:2202.01889v3 [cs.LG] UPDATED)
    Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new dynamics. CoDA leverages multiple environments, each associated to a different dynamic, and learns to condition the dynamics model on contextual parameters, specific to each environment. The conditioning is performed via a hypernetwork, learned jointly with a context vector from observed data. The proposed formulation constrains the search hypothesis space to foster fast adaptation and better generalization across environments. We theoretically motivate our approach and show state-of-the-art generalization results on a set of nonlinear dynamics, representative of a variety of application domains. We also show, on these systems, that new system parameters can be inferred from context vectors with minimal supervision. Code is available at https://github.com/yuan-yin/CoDA .
    Deep Reinforcement Learning for Optimal Power Flow with Renewables Using Graph Information. (arXiv:2112.11461v2 [cs.LG] UPDATED)
    Renewable energy resources (RERs) have been increasingly integrated into large-scale distributed power systems. Considering uncertainties and voltage fluctuation issues introduced by RERs, in this paper, we propose a deep reinforcement learning (DRL)-based strategy leveraging spatial-temporal (ST) graphical information of power systems, to dynamically search for the optimal operation, i.e., optimal power flow (OPF), of power systems with a high uptake of RERs. Specifically, we formulate the OPF problem as a multi-objective optimization problem considering generation cost, voltage fluctuation, and transmission loss, and employ deep deterministic policy gradient (DDPG) to learn an optimal allocation strategy for OPF. Moreover, given that the nodes in power systems are self-correlated and interrelated in temporal and spatial views, we develop a multi-grained attention-based spatial-temporal graph convolution network (MG-ASTGCN) for extracting ST graphical correlations and features, aiming to provide prior knowledge of power systems for its sequential DDPG algorithm to more effectively solve OPF. We validate our algorithm on modified IEEE 33, 69, and 118-bus radial distribution systems and demonstrate that our algorithm outperforms other benchmark algorithms. Our experimental results also reveal that our MG-ASTGCN can significantly accelerate DDPG's training process and performance in solving OPF.
    Turning Your Strength against You: Detecting and Mitigating Robust and Universal Adversarial Patch Attacks. (arXiv:2108.05075v3 [cs.CR] UPDATED)
    Adversarial patch attacks that inject arbitrary distortions within a bounded region of an image, can trigger misclassification in deep neural networks (DNNs). These attacks are robust (i.e., physically realizable) and universally malicious, and hence represent a severe security threat to real-world DNN-based systems. This work proposes Jujutsu, a two-stage technique to detect and mitigate robust and universal adversarial patch attacks. We first observe that patch attacks often yield large influence on the prediction output in order to dominate the prediction on any input, and Jujutsu is built to expose this behavior for effective attack detection. For mitigation, we observe that patch attacks corrupt only a localized region while the remaining contents are unperturbed, based on which Jujutsu leverages GAN-based image inpainting to synthesize the semantic contents in the pixels that are corrupted by the attacks, and reconstruct the ``clean'' image for correct prediction. We evaluate Jujutsu on four diverse datasets and show that it achieves superior performance and significantly outperforms four leading defenses. Jujutsu can further defend against physical-world attacks, attacks that target diverse classes, and adaptive attacks. Our code is available at https://github.com/DependableSystemsLab/Jujutsu.
    Bugs in Machine Learning-based Systems: A Faultload Benchmark. (arXiv:2206.12311v1 [cs.SE])
    The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 113 bugs reported by ML developers on GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.
    GNNSampler: Bridging the Gap between Sampling Algorithms of GNN and Hardware. (arXiv:2108.11571v2 [cs.LG] UPDATED)
    Sampling is a critical operation in Graph Neural Network (GNN) training that helps reduce the cost. Previous literature has explored improving sampling algorithms via mathematical and statistical methods. However, there is a gap between sampling algorithms and hardware. Without consideration of hardware, algorithm designers merely optimize sampling at the algorithm level, missing the great potential of promoting the efficiency of existing sampling algorithms by leveraging hardware features. In this paper, we pioneer to propose a unified programming model for mainstream sampling algorithms, termed GNNSampler, covering the critical processes of sampling algorithms in various categories. Second, to leverage the hardware feature, we choose the data locality as a case study, and explore the data locality among nodes and their neighbors in a graph to alleviate irregular memory access in sampling. Third, we implement locality-aware optimizations in GNNSampler for various sampling algorithms to optimize the general sampling process. Finally, we emphatically conduct experiments on large graph datasets to analyze the relevance among training time, accuracy, and hardware-level metrics. Extensive experiments show that our method is universal to mainstream sampling algorithms and helps significantly reduce the training time, especially in large-scale graphs.
    Cluster Attack: Query-based Adversarial Attacks on Graphs with Graph-Dependent Priors. (arXiv:2109.13069v2 [cs.LG] UPDATED)
    While deep neural networks have achieved great success in graph analysis, recent work has shown that they are vulnerable to adversarial attacks. Compared with adversarial attacks on image classification, performing adversarial attacks on graphs is more challenging because of the discrete and non-differential nature of the adjacent matrix for a graph. In this work, we propose Cluster Attack -- a Graph Injection Attack (GIA) on node classification, which injects fake nodes into the original graph to degenerate the performance of graph neural networks (GNNs) on certain victim nodes while affecting the other nodes as little as possible. We demonstrate that a GIA problem can be equivalently formulated as a graph clustering problem; thus, the discrete optimization problem of the adjacency matrix can be solved in the context of graph clustering. In particular, we propose to measure the similarity between victim nodes by a metric of Adversarial Vulnerability, which is related to how the victim nodes will be affected by the injected fake node, and to cluster the victim nodes accordingly. Our attack is performed in a practical and unnoticeable query-based black-box manner with only a few nodes on the graphs that can be accessed. Theoretical analysis and extensive experiments demonstrate the effectiveness of our method by fooling the node classifiers with only a small number of queries.
    Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. (arXiv:2202.03813v3 [stat.ML] UPDATED)
    This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge regression for which theoretical results such as consistency and excess risk bound are proved. Next we propose an interpretable parametric model where the barycenter weights are modeled with a neural network and the graphs on which the FGW barycenter is calculated are additionally learned. Numerical experiments show the strength of the method and its ability to interpolate in the labeled graph space on simulated data and on a difficult metabolic identification problem where it can reach very good performance with very little engineering.
    Out of distribution robustness with pre-trained Bayesian neural networks. (arXiv:2206.12361v1 [cs.LG])
    We develop ShiftMatch, a new training-data-dependent likelihood for out of distribution (OOD) robustness in Bayesian neural networks (BNNs). ShiftMatch is inspired by the training-data-dependent "EmpCov" priors from Izmailov et al. (2021a) and efficiently matches test-time spatial correlations to those at training time. Critically, ShiftMatch is designed to leave neural network training unchanged, allowing it to use publically available samples from pretrained BNNs. Using pre-trained HMC samples, ShiftMatch gives strong performance improvements on CIFAR-10-C, outperforms EmpCov priors, and is perhaps the first Bayesian method capable of convincingly outperforming plain deep ensembles. ShiftMatch can be integrated with non-Bayesian methods like deep ensembles, where it offers smaller, but still considerable, performance improvements. Overall, Bayesian ShiftMatch gave slightly better accuracy than ensembles with ShiftMatch, though they both had very similar log-likelihoods.
    Adversarially Robust Models may not Transfer Better: Sufficient Conditions for Domain Transferability from the View of Regularization. (arXiv:2202.01832v2 [cs.LG] UPDATED)
    Machine learning (ML) robustness and domain generalization are fundamentally correlated: they essentially concern data distribution shifts under adversarial and natural settings, respectively. On one hand, recent studies show that more robust (adversarially trained) models are more generalizable. On the other hand, there is a lack of theoretical understanding of their fundamental connections. In this paper, we explore the relationship between regularization and domain transferability considering different factors such as norm regularization and data augmentations (DA). We propose a general theoretical framework proving that factors involving the model function class regularization are sufficient conditions for relative domain transferability. Our analysis implies that ``robustness" is neither necessary nor sufficient for transferability; rather, regularization is a more fundamental perspective for understanding domain transferability. We then discuss popular DA protocols (including adversarial training) and show when they can be viewed as the function class regularization under certain conditions and therefore improve generalization. We conduct extensive experiments to verify our theoretical findings and show several counterexamples where robustness and generalization are negatively correlated on different datasets.
    Channel Estimation for RIS-Empowered Multi-User MISO Wireless Communications. (arXiv:2008.01459v2 [cs.IT] UPDATED)
    Reconfigurable Intelligent Surfaces (RISs) have been recently considered as an energy-efficient solution for future wireless networks due to their fast and low-power configuration, which has increased potential in enabling massive connectivity and low-latency communications. Accurate and low-overhead channel estimation in RIS-based systems is one of the most critical challenges due to the usually large number of RIS unit elements and their distinctive hardware constraints. In this paper, we focus on the uplink of a RIS-empowered multi-user Multiple Input Single Output (MISO) uplink communication systems and propose a channel estimation framework based on the parallel factor decomposition to unfold the resulting cascaded channel model. We present two iterative estimation algorithms for the channels between the base station and RIS, as well as the channels between RIS and users. One is based on alternating least squares (ALS), while the other uses vector approximate message passing to iteratively reconstruct two unknown channels from the estimated vectors. To theoretically assess the performance of the ALS-based algorithm, we derived its estimation Cram\'er-Rao Bound (CRB). We also discuss the downlink achievable sum rate computation with estimated channels and different precoding schemes for the base station. Our extensive simulation results show that our algorithms outperform benchmark schemes and that the ALS technique achieves the CRB. It is also demonstrated that the sum rate using the estimated channels always reach that of perfect channels under various settings, thus, verifying the effectiveness and robustness of the proposed estimation algorithms.
    Source Localization of Graph Diffusion via Variational Autoencoders for Graph Inverse Problems. (arXiv:2206.12327v1 [cs.LG])
    Graph diffusion problems such as the propagation of rumors, computer viruses, or smart grid failures are ubiquitous and societal. Hence it is usually crucial to identify diffusion sources according to the current graph diffusion observations. Despite its tremendous necessity and significance in practice, source localization, as the inverse problem of graph diffusion, is extremely challenging as it is ill-posed: different sources may lead to the same graph diffusion patterns. Different from most traditional source localization methods, this paper focuses on a probabilistic manner to account for the uncertainty of different candidate sources. Such endeavors require overcoming challenges including 1) the uncertainty in graph diffusion source localization is hard to be quantified; 2) the complex patterns of the graph diffusion sources are difficult to be probabilistically characterized; 3) the generalization under any underlying diffusion patterns is hard to be imposed. To solve the above challenges, this paper presents a generic framework: Source Localization Variational AutoEncoder (SL-VAE) for locating the diffusion sources under arbitrary diffusion patterns. Particularly, we propose a probabilistic model that leverages the forward diffusion estimation model along with deep generative models to approximate the diffusion source distribution for quantifying the uncertainty. SL-VAE further utilizes prior knowledge of the source-observation pairs to characterize the complex patterns of diffusion sources by a learned generative prior. Lastly, a unified objective that integrates the forward diffusion estimation model is derived to enforce the model to generalize under arbitrary diffusion patterns. Extensive experiments are conducted on 7 real-world datasets to demonstrate the superiority of SL-VAE in reconstructing the diffusion sources by excelling other methods on average 20% in AUC score.
    Simplified and Unified Analysis of Various Learning Problems by Reduction to Multiple-Instance Learning. (arXiv:1911.05999v4 [cs.LG] UPDATED)
    In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized.
    HANF: Hyperparameter And Neural Architecture Search in Federated Learning. (arXiv:2206.12342v1 [cs.LG])
    Automated machine learning (AutoML) is an important step to make machine learning models being widely applied to solve real world problems. Despite numerous research advancement, machine learning methods are not fully utilized by industries mainly due to their data privacy and security regulations, high cost involved in storing and computing increasing amount of data at central location and most importantly lack of expertise. Hence, we introduce a novel framework, HANF - $\textbf{H}$yperparameter $\textbf{A}$nd $\textbf{N}$eural architecture search in $\textbf{F}$ederated learning as a step towards building an AutoML framework for data distributed across several data owner servers without any need for bringing the data to a central location. HANF jointly optimizes a neural architecture and non-architectural hyperparameters of a learning algorithm using gradient-based neural architecture search and $n$-armed bandit approach respectively in data distributed setting. We show that HANF efficiently finds the optimized neural architecture and also tunes the hyperparameters on data owner servers. Additionally, HANF can be applied in both, federated and non-federated settings. Empirically, we show that HANF converges towards well-suited architectures and non-architectural hyperparameter-sets using image-classification tasks.
    A Framework of Inertial Alternating Direction Method of Multipliers for Non-Convex Non-Smooth Optimization. (arXiv:2102.05433v2 [math.OC] UPDATED)
    In this paper, we propose an algorithmic framework, dubbed inertial alternating direction methods of multipliers (iADMM), for solving a class of nonconvex nonsmooth multiblock composite optimization problems with linear constraints. Our framework employs the general minimization-majorization (MM) principle to update each block of variables so as to not only unify the convergence analysis of previous ADMM that use specific surrogate functions in the MM step, but also lead to new efficient ADMM schemes. To the best of our knowledge, in the nonconvex nonsmooth setting, ADMM used in combination with the MM principle to update each block of variables, and ADMM combined with \emph{inertial terms for the primal variables} have not been studied in the literature. Under standard assumptions, we prove the subsequential convergence and global convergence for the generated sequence of iterates. We illustrate the effectiveness of iADMM on a class of nonconvex low-rank representation problems.
    ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings. (arXiv:2206.12403v1 [cs.CV])
    We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables open-world ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., "Find a kitchen sink") and when the target room can be inferred (e.g., "Find a sink and a stove").
    Hard hat wearing detection based on head keypoint localization. (arXiv:2106.10944v2 [cs.CV] UPDATED)
    In recent years, a lot of attention is paid to deep learning methods in the context of vision-based construction site safety systems, especially regarding personal protective equipment. However, despite all this attention, there is still no reliable way to establish the relationship between workers and their hard hats. To answer this problem a combination of deep learning, object detection and head keypoint localization, with simple rule-based reasoning is proposed in this article. In tests, this solution surpassed the previous methods based on the relative bounding box position of different instances, as well as direct detection of hard hat wearers and non-wearers. The results show that the conjunction of novel deep learning methods with humanly-interpretable rule-based systems can result in a solution that is both reliable and can successfully mimic manual, on-site supervision. This work is the next step in the development of fully autonomous construction site safety systems and shows that there is still room for improvement in this area.
    Empirical and Instance-Dependent Estimation of Markov Chain and Mixing Time. (arXiv:1912.06845v3 [math.PR] UPDATED)
    We tackle the problem of estimating the mixing time of a Markov chain from a single trajectory of observations. In contrast with previous works which considered Hilbert space methods to estimate spectral gaps, we opt for an approach based on contraction with respect to total variation. Specifically, we define and estimate a generalized contraction coefficient based on Dobrushin's. We show that this quantity -- unlike the spectral gap -- controls the mixing time up to strong universal constants and remains valid for non-reversible chains. We design fully data-dependent confidence intervals around the coefficient, which are both easier to compute and thinner than their spectral counterparts. Furthermore, we initiate the beyond worst-case analysis, by showing how to leverage additional information about the transition matrix in order to obtain instance-dependent rates for its estimation with respect to the induced uniform norm, as well as some of its mixing properties.
    Socially-Compatible Behavior Design of Autonomous Vehicles with Verification on Real Human Data. (arXiv:2010.14712v8 [cs.RO] UPDATED)
    As more and more autonomous vehicles (AVs) are being deployed on public roads, designing socially compatible behaviors for them is becoming increasingly important. In order to generate safe and efficient actions, AVs need to not only predict the future behaviors of other traffic participants, but also be aware of the uncertainties associated with such behavior prediction. In this paper, we propose an uncertain-aware integrated prediction and planning (UAPP) framework. It allows the AVs to infer the characteristics of other road users online and generate behaviors optimizing not only their own rewards, but also their courtesy to others, and their confidence regarding the prediction uncertainties. We first propose the definitions for courtesy and confidence. Based on that, their influences on the behaviors of AVs in interactive driving scenarios are explored. Moreover, we evaluate the proposed algorithm on naturalistic human driving data by comparing the generated behavior against ground truth. Results show that the online inference can significantly improve the human-likeness of the generated behaviors. Furthermore, we find that human drivers show great courtesy to others, even for those without right-of-way. We also find that such driving preferences vary significantly in different cultures.
    How many labelers do you have? A closer look at gold-standard labels. (arXiv:2206.12041v1 [math.ST])
    The construction of most supervised learning datasets revolves around collecting multiple labels for each instance, then aggregating the labels to form a type of ``gold-standard.''. We question the wisdom of this pipeline by developing a (stylized) theoretical model of this process and analyzing its statistical consequences, showing how access to non-aggregated label information can make training well-calibrated models easier or -- in some cases -- even feasible, whereas it is impossible with only gold-standard labels. The entire story, however, is subtle, and the contrasts between aggregated and fuller label information depend on the particulars of the problem, where estimators that use aggregated information exhibit robust but slower rates of convergence, while estimators that can effectively leverage all labels converge more quickly if they have fidelity to (or can learn) the true labeling process. The theory we develop in the stylized model makes several predictions for real-world datasets, including when non-aggregate labels should improve learning performance, which we test to corroborate the validity of our predictions.
    Quantifying Inherent Randomness in Machine Learning Algorithms. (arXiv:2206.12353v1 [stat.ML])
    Most machine learning (ML) algorithms have several stochastic elements, and their performances are affected by these sources of randomness. This paper uses an empirical study to systematically examine the effects of two sources: randomness in model training and randomness in the partitioning of a dataset into training and test subsets. We quantify and compare the magnitude of the variation in predictive performance for the following ML algorithms: Random Forests (RFs), Gradient Boosting Machines (GBMs), and Feedforward Neural Networks (FFNNs). Among the different algorithms, randomness in model training causes larger variation for FFNNs compared to tree-based methods. This is to be expected as FFNNs have more stochastic elements that are part of their model initialization and training. We also found that random splitting of datasets leads to higher variation compared to the inherent randomness from model training. The variation from data splitting can be a major issue if the original dataset has considerable heterogeneity. Keywords: Model Training, Reproducibility, Variation
    Data-driven discovery of novel 2D materials by deep generative models. (arXiv:2206.12159v1 [cond-mat.mtrl-sci])
    Efficient algorithms to generate candidate crystal structures with good stability properties can play a key role in data-driven materials discovery. Here we show that a crystal diffusion variational autoencoder (CDVAE) is capable of generating two-dimensional (2D) materials of high chemical and structural diversity and formation energies mirroring the training structures. Specifically, we train the CDVAE on 2615 2D materials with energy above the convex hull $\Delta H_{\mathrm{hull}}< 0.3$ eV/atom, and generate 5003 materials that we relax using density functional theory (DFT). We also generate 14192 new crystals by systematic element substitution of the training structures. We find that the generative model and lattice decoration approach are complementary and yield materials with similar stability properties but very different crystal structures and chemical compositions. In total we find 11630 predicted new 2D materials, where 8599 of these have $\Delta H_{\mathrm{hull}}< 0.3$ eV/atom as the seed structures, while 2004 are within 50 meV of the convex hull and could potentially be synthesized. The relaxed atomic structures of all the materials are available in the open Computational 2D Materials Database (C2DB). Our work establishes the CDVAE as an efficient and reliable crystal generation machine, and significantly expands the space of 2D materials.
    Set Norm and Equivariant Skip Connections: Putting the Deep in Deep Sets. (arXiv:2206.11925v1 [cs.LG])
    Permutation invariant neural networks are a promising tool for making predictions from sets. However, we show that existing permutation invariant architectures, Deep Sets and Set Transformer, can suffer from vanishing or exploding gradients when they are deep. Additionally, layer norm, the normalization of choice in Set Transformer, can hurt performance by removing information useful for prediction. To address these issues, we introduce the clean path principle for equivariant residual connections and develop set norm, a normalization tailored for sets. With these, we build Deep Sets++ and Set Transformer++, models that reach high depths with comparable or better performance than their original counterparts on a diverse suite of tasks. We additionally introduce Flow-RBC, a new single-cell dataset and real-world application of permutation invariant prediction. We open-source our data and code here: https://github.com/rajesh-lab/deep_permutation_invariant.
    Physically Consistent Learning of Conservative Lagrangian Systems with Gaussian Processes. (arXiv:2206.12272v1 [cs.LG])
    This paper proposes a physically consistent Gaussian Process (GP) enabling the identification of uncertain Lagrangian systems. The function space is tailored according to the energy components of the Lagrangian and the differential equation structure, analytically guaranteeing physical and mathematical properties such as energy conservation and quadratic form. The novel formulation of Cholesky decomposed matrix kernels allow the probabilistic preservation of positive definiteness. Only differential input-to-output measurements of the function map are required while Gaussian noise is permitted in torques, velocities, and accelerations. We demonstrate the effectiveness of the approach in numerical simulation.
    Three Applications of Conformal Prediction for Rating Breast Density in Mammography. (arXiv:2206.12008v1 [eess.IV])
    Breast cancer is the most common cancers and early detection from mammography screening is crucial in improving patient outcomes. Assessing mammographic breast density is clinically important as the denser breasts have higher risk and are more likely to occlude tumors. Manual assessment by experts is both time-consuming and subject to inter-rater variability. As such, there has been increased interest in the development of deep learning methods for mammographic breast density assessment. Despite deep learning having demonstrated impressive performance in several prediction tasks for applications in mammography, clinical deployment of deep learning systems in still relatively rare; historically, mammography Computer-Aided Diagnoses (CAD) have over-promised and failed to deliver. This is in part due to the inability to intuitively quantify uncertainty of the algorithm for the clinician, which would greatly enhance usability. Conformal prediction is well suited to increase reliably and trust in deep learning tools but they lack realistic evaluations on medical datasets. In this paper, we present a detailed analysis of three possible applications of conformal prediction applied to medical imaging tasks: distribution shift characterization, prediction quality improvement, and subgroup fairness analysis. Our results show the potential of distribution-free uncertainty quantification techniques to enhance trust on AI algorithms and expedite their translation to usage.
    On the Limitations of Elo: Real-World Games, are Transitive, not Additive. (arXiv:2206.12301v1 [cs.GT])
    Real-world competitive games, such as chess, go, or StarCraft II, rely on Elo models to measure the strength of their players. Since these games are not fully transitive, using Elo implicitly assumes they have a strong transitive component that can correctly be identified and extracted. In this study, we investigate the challenge of identifying the strength of the transitive component in games. First, we show that Elo models can fail to extract this transitive component, even in elementary transitive games. Then, based on this observation, we propose an extension of the Elo score: we end up with a disc ranking system that assigns each player two scores, which we refer to as skill and consistency. Finally, we propose an empirical validation on payoff matrices coming from real-world games played by bots and humans.
    Deep Stable neural networks: large-width asymptotics and convergence rates. (arXiv:2108.02316v2 [cs.LG] UPDATED)
    In modern deep learning, there is a recent and growing literature on the interplay between large-width asymptotic properties of deep Gaussian neural networks (NNs), i.e. deep NNs with Gaussian-distributed weights, and Gaussian stochastic processes (SPs). Such an interplay has proved to be critical in Bayesian inference under Gaussian SP priors, kernel regression for infinitely wide deep NNs trained via gradient descent, and information propagation within infinitely wide NNs. Motivated by empirical analyses that show the potential of replacing Gaussian distributions with Stable distributions for the NN's weights, in this paper we present a rigorous analysis of the large-width asymptotic behaviour of (fully connected) feed-forward deep Stable NNs, i.e. deep NNs with Stable-distributed weights. We show that as the width goes to infinity jointly over the NN's layers, i.e. the ``joint growth" setting, a rescaled deep Stable NN converges weakly to a Stable SP whose distribution is characterized recursively through the NN's layers. Because of the non-triangular structure of the NN, this is a non-standard asymptotic problem, to which we propose an inductive approach of independent interest. Then, we establish sup-norm convergence rates of the rescaled deep Stable NN to the Stable SP, under the ``joint growth" and a ``sequential growth" of the width over the NN's layers. Such a result provides the difference between the ``joint growth" and the ``sequential growth" settings, showing that the former leads to a slower rate than the latter, depending on the depth of the layer and the number of inputs of the NN. Our work extends some recent results on infinitely wide limits for deep Gaussian NNs to the more general deep Stable NNs, providing the first result on convergence rates in the ``joint growth" setting.
    Indecision Trees: Learning Argument-Based Reasoning under Quantified Uncertainty. (arXiv:2206.12252v1 [cs.LG])
    Using Machine Learning systems in the real world can often be problematic, with inexplicable black-box models, the assumed certainty of imperfect measurements, or providing a single classification instead of a probability distribution. This paper introduces Indecision Trees, a modification to Decision Trees which learn under uncertainty, can perform inference under uncertainty, provide a robust distribution over the possible labels, and can be disassembled into a set of logical arguments for use in other reasoning systems.
    Content Popularity Prediction Based on Quantized Federated Bayesian Learning in Fog Radio Access Networks. (arXiv:2206.12258v1 [cs.LG])
    In this paper, we investigate the content popularity prediction problem in cache-enabled fog radio access networks (F-RANs). In order to predict the content popularity with high accuracy and low complexity, we propose a Gaussian process based regressor to model the content request pattern. Firstly, the relationship between content features and popularity is captured by our proposed model. Then, we utilize Bayesian learning to train the model parameters, which is robust to overfitting. However, Bayesian methods are usually unable to find a closed-form expression of the posterior distribution. To tackle this issue, we apply a stochastic variance reduced gradient Hamiltonian Monte Carlo (SVRG-HMC) method to approximate the posterior distribution. To utilize the computing resources of other fog access points (F-APs) and to reduce the communications overhead, we propose a quantized federated learning (FL) framework combining with Bayesian learning. The quantized federated Bayesian learning framework allows each F-AP to send gradients to the cloud server after quantizing and encoding. It can achieve a tradeoff between prediction accuracy and communications overhead effectively. Simulation results show that the performance of our proposed policy outperforms the existing policies.
    End-to-End Text-to-Speech Based on Latent Representation of Speaking Styles Using Spontaneous Dialogue. (arXiv:2206.12040v1 [eess.AS])
    The recent text-to-speech (TTS) has achieved quality comparable to that of humans; however, its application in spoken dialogue has not been widely studied. This study aims to realize a TTS that closely resembles human dialogue. First, we record and transcribe actual spontaneous dialogues. Then, the proposed dialogue TTS is trained in two stages: first stage, variational autoencoder (VAE)-VITS or Gaussian mixture variational autoencoder (GMVAE)-VITS is trained, which introduces an utterance-level latent variable into variational inference with adversarial learning for end-to-end text-to-speech (VITS), a recently proposed end-to-end TTS model. A style encoder that extracts a latent speaking style representation from speech is trained jointly with TTS. In the second stage, a style predictor is trained to predict the speaking style to be synthesized from dialogue history. During inference, by passing the speaking style representation predicted by the style predictor to VAE/GMVAE-VITS, speech can be synthesized in a style appropriate to the context of the dialogue. Subjective evaluation results demonstrate that the proposed method outperforms the original VITS in terms of dialogue-level naturalness.
    Symbolic-Regression Boosting. (arXiv:2206.12082v1 [cs.NE])
    Modifying standard gradient boosting by replacing the embedded weak learner in favor of a strong(er) one, we present SyRBo: Symbolic-Regression Boosting. Experiments over 98 regression datasets show that by adding a small number of boosting stages -- between 2--5 -- to a symbolic regressor, statistically significant improvements can often be attained. We note that coding SyRBo on top of any symbolic regressor is straightforward, and the added cost is simply a few more evolutionary rounds. SyRBo is essentially a simple add-on that can be readily added to an extant symbolic regressor, often with beneficial results.
    Iterative Sound Source Localization for Unknown Number of Sources. (arXiv:2206.12273v1 [eess.AS])
    Sound source localization aims to seek the direction of arrival (DOA) of all sound sources from the observed multi-channel audio. For the practical problem of unknown number of sources, existing localization algorithms attempt to predict a likelihood-based coding (i.e., spatial spectrum) and employ a pre-determined threshold to detect the source number and corresponding DOA value. However, these threshold-based algorithms are not stable since they are limited by the careful choice of threshold. To address this problem, we propose an iterative sound source localization approach called ISSL, which can iteratively extract each source's DOA without threshold until the termination criterion is met. Unlike threshold-based algorithms, ISSL designs an active source detector network based on binary classifier to accept residual spatial spectrum and decide whether to stop the iteration. By doing so, our ISSL can deal with an arbitrary number of sources, even more than the number of sources seen during the training stage. The experimental results show that our ISSL achieves significant performance improvements in both DOA estimation and source number detection compared with the existing threshold-based algorithms.
    From Tensor Network Quantum States to Tensorial Recurrent Neural Networks. (arXiv:2206.12363v1 [quant-ph])
    We show that any matrix product state (MPS) can be exactly represented by a recurrent neural network (RNN) with a linear memory update. We generalize this RNN architecture to 2D lattices using a multilinear memory update. It supports perfect sampling and wave function evaluation in polynomial time, and can represent an area law of entanglement entropy. Numerical evidence shows that it can encode the wave function using a bond dimension lower by orders of magnitude when compared to MPS, with an accuracy that can be systematically improved by increasing the bond dimension.
    RARTS: An Efficient First-Order Relaxed Architecture Search Method. (arXiv:2008.03901v2 [cs.LG] UPDATED)
    Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. Despite its success in many architecture search tasks, there are still some concerns about the accuracy of first-order DARTS and the efficiency of the second-order DARTS. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes the whole dataset in architecture learning via both data and network splitting, without involving mixed second derivatives of the corresponding loss functions like DARTS. In our formulation of network splitting, two networks with different but related weights cooperate in search of a shared architecture. The advantage of RARTS over DARTS is justified by a convergence theorem and an analytically solvable model. Moreover, RARTS outperforms DARTS and its variants in accuracy and search efficiency, as shown in adequate experimental results. For the task of searching topological architecture, i.e., the edges and the operations, RARTS obtains a higher accuracy and 60\% reduction of computational cost than second-order DARTS on CIFAR-10. RARTS continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm without modifying search space. For the task of searching width, i.e., the number of channels in convolutional layers, RARTS also outperforms the traditional network pruning benchmarks. Further experiments on the public architecture search benchmark like NATS-Bench also support the preeminence of RARTS.
    MPClan: Protocol Suite for Privacy-Conscious Computations. (arXiv:2206.12224v1 [cs.CR])
    The growing volumes of data being collected and its analysis to provide better services are creating worries about digital privacy. To address privacy concerns and give practical solutions, the literature has relied on secure multiparty computation. However, recent research has mostly focused on the small-party honest-majority setting of up to four parties, noting efficiency concerns. In this work, we extend the strategies to support a larger number of participants in an honest-majority setting with efficiency at the center stage. Cast in the preprocessing paradigm, our semi-honest protocol improves the online complexity of the decade-old state-of-the-art protocol of Damg\aa rd and Nielson (CRYPTO'07). In addition to having an improved online communication cost, we can shut down almost half of the parties in the online phase, thereby saving up to 50% in the system's operational costs. Our maliciously secure protocol also enjoys similar benefits and requires only half of the parties, except for one-time verification, towards the end. To showcase the practicality of the designed protocols, we benchmark popular applications such as deep neural networks, graph neural networks, genome sequence matching, and biometric matching using prototype implementations. Our improved protocols aid in bringing up to 60-80% savings in monetary cost over prior work.
    Multi-Frequency Joint Community Detection and Phase Synchronization. (arXiv:2206.12276v1 [cs.SI])
    This paper studies the joint community detection and phase synchronization problem on the \textit{stochastic block model with relative phase}, where each node is associated with a phase. This problem, with a variety of real-world applications, aims to recover community memberships and associated phases simultaneously. By studying the maximum likelihood estimation formulation, we show that this problem exhibits a \textit{``multi-frequency''} structure. To this end, two simple yet efficient algorithms that leverage information across multiple frequencies are proposed. The former is a spectral method based on the novel multi-frequency column-pivoted QR factorization, and the latter is an iterative multi-frequency generalized power method. Numerical experiments indicate our proposed algorithms outperform state-of-the-art algorithms, in recovering community memberships and associated phases.
    Adversarial Robustness of Deep Neural Networks: A Survey from a Formal Verification Perspective. (arXiv:2206.12227v1 [cs.CR])
    Neural networks have been widely applied in security applications such as spam and phishing detection, intrusion prevention, and malware detection. This black-box method, however, often has uncertainty and poor explainability in applications. Furthermore, neural networks themselves are often vulnerable to adversarial attacks. For those reasons, there is a high demand for trustworthy and rigorous methods to verify the robustness of neural network models. Adversarial robustness, which concerns the reliability of a neural network when dealing with maliciously manipulated inputs, is one of the hottest topics in security and machine learning. In this work, we survey existing literature in adversarial robustness verification for neural networks and collect 39 diversified research works across machine learning, security, and software engineering domains. We systematically analyze their approaches, including how robustness is formulated, what verification techniques are used, and the strengths and limitations of each technique. We provide a taxonomy from a formal verification perspective for a comprehensive understanding of this topic. We classify the existing techniques based on property specification, problem reduction, and reasoning strategies. We also demonstrate representative techniques that have been applied in existing studies with a sample model. Finally, we discuss open questions for future research.
    ModLaNets: Learning Generalisable Dynamics via Modularity and Physical Inductive Bias. (arXiv:2206.12325v1 [cs.LG])
    Deep learning models are able to approximate one specific dynamical system but struggle at learning generalisable dynamics, where dynamical systems obey the same laws of physics but contain different numbers of elements (e.g., double- and triple-pendulum systems). To relieve this issue, we proposed the Modular Lagrangian Network (ModLaNet), a structural neural network framework with modularity and physical inductive bias. This framework models the energy of each element using modularity and then construct the target dynamical system via Lagrangian mechanics. Modularity is beneficial for reusing trained networks and reducing the scale of networks and datasets. As a result, our framework can learn from the dynamics of simpler systems and extend to more complex ones, which is not feasible using other relevant physics-informed neural networks. We examine our framework for modelling double-pendulum or three-body systems with small training datasets, where our models achieve the best data efficiency and accuracy performance compared with counterparts. We also reorganise our models as extensions to model multi-pendulum and multi-body systems, demonstrating the intriguing reusable feature of our framework.
    A Manifold-based Airfoil Geometric-feature Extraction and Discrepant Data Fusion Learning Method. (arXiv:2206.12254v1 [cs.LG])
    Geometrical shape of airfoils, together with the corresponding flight conditions, are crucial factors for aerodynamic performances prediction. The obtained airfoils geometrical features in most existing approaches (e.g., geometrical parameters extraction, polynomial description and deep learning) are in Euclidean space. State-of-the-art studies showed that curves or surfaces of an airfoil formed a manifold in Riemannian space. Therefore, the features extracted by existing methods are not sufficient to reflect the geometric-features of airfoils. Meanwhile, flight conditions and geometric features are greatly discrepant with different types, the relevant knowledge of the influence of these two factors that on final aerodynamic performances predictions must be evaluated and learned to improve prediction accuracy. Motivated by the advantages of manifold theory and multi-task learning, we propose a manifold-based airfoil geometric-feature extraction and discrepant data fusion learning method (MDF) to extract geometric-features of airfoils in Riemannian space (we call them manifold-features) and further fuse the manifold-features with flight conditions to predict aerodynamic performances. Experimental results show that our method could extract geometric-features of airfoils more accurately compared with existing methods, that the average MSE of re-built airfoils is reduced by 56.33%, and while keeping the same predicted accuracy level of CL, the MSE of CD predicted by MDF is further reduced by 35.37%.
    zPROBE: Zero Peek Robustness Checks for Federated Learning. (arXiv:2206.12100v1 [cs.LG])
    Privacy-preserving federated learning allows multiple users to jointly train a model with coordination of a central server. The server only learns the final aggregation result, thereby preventing leakage of the users' (private) training data from the individual model updates. However, keeping the individual updates private allows malicious users to perform Byzantine attacks and degrade the model accuracy without being detected. Best existing defenses against Byzantine workers rely on robust rank-based statistics, e.g., the median, to find malicious updates. However, implementing privacy-preserving rank-based statistics is nontrivial and unscalable in the secure domain, as it requires sorting of all individual updates. We establish the first private robustness check that uses high break point rank-based statistics on aggregated model updates. By exploiting randomized clustering, we significantly improve the scalability of our defense without compromising privacy. We leverage the derived statistical bounds in zero-knowledge proofs to detect and remove malicious updates without revealing the private user updates. Our novel framework, zPROBE, enables Byzantine resilient and secure federated learning. Empirical evaluations demonstrate that zPROBE provides a low overhead solution to defend against state-of-the-art Byzantine attacks while preserving privacy.
    InfoAT: Improving Adversarial Training Using the Information Bottleneck Principle. (arXiv:2206.12292v1 [cs.LG])
    Adversarial training (AT) has shown excellent high performance in defending against adversarial examples. Recent studies demonstrate that examples are not equally important to the final robustness of models during AT, that is, the so-called hard examples that can be attacked easily exhibit more influence than robust examples on the final robustness. Therefore, guaranteeing the robustness of hard examples is crucial for improving the final robustness of the model. However, defining effective heuristics to search for hard examples is still difficult. In this article, inspired by the information bottleneck (IB) principle, we uncover that an example with high mutual information of the input and its associated latent representation is more likely to be attacked. Based on this observation, we propose a novel and effective adversarial training method (InfoAT). InfoAT is encouraged to find examples with high mutual information and exploit them efficiently to improve the final robustness of models. Experimental results show that InfoAT achieves the best robustness among different datasets and models in comparison with several state-of-the-art methods.
    Synthesizing Rolling Bearing Fault Samples in New Conditions: A framework based on a modified CGAN. (arXiv:2206.12076v1 [cs.LG])
    Bearings are one of the vital components of rotating machines that are prone to unexpected faults. Therefore, bearing fault diagnosis and condition monitoring is essential for reducing operational costs and downtime in numerous industries. In various production conditions, bearings can be operated under a range of loads and speeds, which causes different vibration patterns associated with each fault type. Normal data is ample as systems usually work in desired conditions. On the other hand, fault data is rare, and in many conditions, there is no data recorded for the fault classes. Accessing fault data is crucial for developing data-driven fault diagnosis tools that can improve both the performance and safety of operations. To this end, a novel algorithm based on Conditional Generative Adversarial Networks (CGANs) is introduced. Trained on the normal and fault data on any actual fault conditions, this algorithm generates fault data from normal data of target conditions. The proposed method is validated on a real-world bearing dataset, and fault data are generated for different conditions. Several state-of-the-art classifiers and visualization models are implemented to evaluate the quality of the synthesized data. The results demonstrate the efficacy of the proposed algorithm.
    AnyMorph: Learning Transferable Polices By Inferring Agent Morphology. (arXiv:2206.12279v1 [cs.LG])
    The prototypical approach to reinforcement learning involves training policies tailored to a particular agent from scratch for every new morphology. Recent work aims to eliminate the re-training of policies by investigating whether a morphology-agnostic policy, trained on a diverse set of agents with similar task objectives, can be transferred to new agents with unseen morphologies without re-training. This is a challenging problem that required previous approaches to use hand-designed descriptions of the new agent's morphology. Instead of hand-designing this description, we propose a data-driven method that learns a representation of morphology directly from the reinforcement learning objective. Ours is the first reinforcement learning algorithm that can train a policy to generalize to new agent morphologies without requiring a description of the agent's morphology in advance. We evaluate our approach on the standard benchmark for agent-agnostic control, and improve over the current state of the art in zero-shot generalization to new agents. Importantly, our method attains good performance without an explicit description of morphology.
    Multi-modal Sensor Data Fusion for In-situ Classification of Animal Behavior Using Accelerometry and GNSS Data. (arXiv:2206.12078v1 [cs.LG])
    We examine using data from multiple sensing modes, i.e., accelerometry and global navigation satellite system (GNSS), for classifying animal behavior. We extract three new features from the GNSS data, namely, the distance from the water point, median speed, and median estimated horizontal position error. We consider two approaches for combining the information available from the accelerometry and GNSS data. The first approach is based on concatenating the features extracted from both sensor data and feeding the concatenated feature vector into a multi-layer perceptron (MLP) classifier. The second approach is based on fusing the posterior probabilities predicted by two MLP classifiers each taking the features extracted from the data of one sensor as input. We evaluate the performance of the developed multi-modal animal behavior classification algorithms using two real-world datasets collected via smart cattle collar and ear tags. The leave-one-animal-out cross-validation results show that both approaches improve the classification performance appreciably compared with using the data from only one sensing mode, in particular, for the infrequent but important behaviors of walking and drinking. The algorithms developed based on both approaches require rather small computational and memory resources hence are suitable for implementation on embedded systems of our collar and ear tags. However, the multi-modal animal behavior classification algorithm based on posterior probability fusion is preferable to the one based on feature concatenation as it delivers better classification accuracy, has less computational and memory complexity, is more robust to sensor data failure, and enjoys better modularity.
    Computational Complexity Evaluation of Neural Network Applications in Signal Processing. (arXiv:2206.12191v1 [eess.SP])
    In this paper, we provide a systematic approach for assessing and comparing the computational complexity of neural network layers in digital signal processing. We provide and link four software-to-hardware complexity measures, defining how the different complexity metrics relate to the layers' hyper-parameters. This paper explains how to compute these four metrics for feed-forward and recurrent layers, and defines in which case we ought to use a particular metric depending on whether we characterize a more soft- or hardware-oriented application. One of the four metrics, called `the number of additions and bit shifts (NABS)', is newly introduced for heterogeneous quantization. NABS characterizes the impact of not only the bitwidth used in the operation but also the type of quantization used in the arithmetical operations. We intend this work to serve as a baseline for the different levels (purposes) of complexity estimation related to the neural networks' application in real-time digital signal processing, aiming at unifying the computational complexity estimation.
    Adversarial Zoom Lens: A Novel Physical-World Attack to DNNs. (arXiv:2206.12251v1 [cs.CR])
    Although deep neural networks (DNNs) are known to be fragile, no one has studied the effects of zooming-in and zooming-out of images in the physical world on DNNs performance. In this paper, we demonstrate a novel physical adversarial attack technique called Adversarial Zoom Lens (AdvZL), which uses a zoom lens to zoom in and out of pictures of the physical world, fooling DNNs without changing the characteristics of the target object. The proposed method is so far the only adversarial attack technique that does not add physical adversarial perturbation attack DNNs. In a digital environment, we construct a data set based on AdvZL to verify the antagonism of equal-scale enlarged images to DNNs. In the physical environment, we manipulate the zoom lens to zoom in and out of the target object, and generate adversarial samples. The experimental results demonstrate the effectiveness of AdvZL in both digital and physical environments. We further analyze the antagonism of the proposed data set to the improved DNNs. On the other hand, we provide a guideline for defense against AdvZL by means of adversarial training. Finally, we look into the threat possibilities of the proposed approach to future autonomous driving and variant attack ideas similar to the proposed attack.
    Towards FPGA Implementation of Neural Network-Based Nonlinearity Mitigation Equalizers in Coherent Optical Transmission Systems. (arXiv:2206.12180v1 [eess.SP])
    For the first time, recurrent and feedforward neural network-based equalizers for nonlinearity compensation are implemented in an FPGA, with a level of complexity comparable to that of a dispersion equalizer. We demonstrate that the NN-based equalizers can outperform a 1 step-per-span DBP.
    CoSP: Co-supervised pretraining of pocket and ligand. (arXiv:2206.12241v1 [cs.LG])
    Can we inject the pocket-ligand interaction knowledge into the pre-trained model and jointly learn their chemical space? Pretraining molecules and proteins has attracted considerable attention in recent years, while most of these approaches focus on learning one of the chemical spaces and lack the injection of biological knowledge. We propose a co-supervised pretraining (CoSP) framework to simultaneously learn 3D pocket and ligand representations. We use a gated geometric message passing layer to model both 3D pockets and ligands, where each node's chemical features, geometric position and orientation are considered. To learn biological meaningful embeddings, we inject the pocket-ligand interaction knowledge into the pretraining model via contrastive loss. Considering the specificity of molecules, we further propose a chemical similarity-enhanced negative sampling strategy to improve the contrastive learning performance. Through extensive experiments, we conclude that CoSP can achieve competitive results in pocket matching, molecule property predictions, and virtual screening.
    Reinforcement learning based adaptive metaheuristics. (arXiv:2206.12233v1 [cs.NE])
    Parameter adaptation, that is the capability to automatically adjust an algorithm's hyperparameters depending on the problem being faced, is one of the main trends in evolutionary computation applied to numerical optimization. While several handcrafted adaptation policies have been proposed over the years to address this problem, only few attempts have been done so far at apply machine learning to learn such policies. Here, we introduce a general-purpose framework for performing parameter adaptation in continuous-domain metaheuristics based on state-of-the-art reinforcement learning algorithms. We demonstrate the applicability of this framework on two algorithms, namely Covariance Matrix Adaptation Evolution Strategies (CMA-ES) and Differential Evolution (DE), for which we learn, respectively, adaptation policies for the step-size (for CMA-ES), and the scale factor and crossover rate (for DE). We train these policies on a set of 46 benchmark functions at different dimensionalities, with various inputs to the policies, in two settings: one policy per function, and one global policy for all functions. Compared, respectively, to the Cumulative Step-size Adaptation (CSA) policy and to two well-known adaptive DE variants (iDE and jDE), our policies are able to produce competitive results in the majority of cases, especially in the case of DE.
    World Value Functions: Knowledge Representation for Learning and Planning. (arXiv:2206.11940v1 [cs.AI])
    We propose world value functions (WVFs), a type of goal-oriented general value function that represents how to solve not just a given task, but any other goal-reaching task in an agent's environment. This is achieved by equipping an agent with an internal goal space defined as all the world states where it experiences a terminal transition. The agent can then modify the standard task rewards to define its own reward function, which provably drives it to learn how to achieve all reachable internal goals, and the value of doing so in the current task. We demonstrate two key benefits of WVFs in the context of learning and planning. In particular, given a learned WVF, an agent can compute the optimal policy in a new task by simply estimating the task's reward function. Furthermore, we show that WVFs also implicitly encode the transition dynamics of the environment, and so can be used to perform planning. Experimental results show that WVFs can be learned faster than regular value functions, while their ability to infer the environment's dynamics can be used to integrate learning and planning methods to further improve sample efficiency.
    Cyclic Graph Attentive Match Encoder (CGAME): A Novel Neural Network For OD Estimation. (arXiv:2111.14625v3 [cs.LG] UPDATED)
    Origin-Destination Estimation plays an important role in traffic management and traffic simulation in the era of Intelligent Transportation System (ITS). Nevertheless, previous model-based methods face the under-determined challenge, thus desperate demand for additional assumptions and extra data exists. Deep learning provides an ideal data-based method for connecting inputs and outputs by probabilistic distribution transformation. While relevant researches of applying deep learning into OD estimation are limited due to the challenges lying in data transformation across representation space, especially from dynamic spatial-temporal space to heterogeneous graph in this issue. To address it, we propose Cyclic Graph Attentive Matching Encoder (C-GAME) based on a novel Graph Matcher with double-layer attention mechanism. It realizes effective information exchange and establishes coupling relationship across underlying feature space. The proposed model achieves state-of-the-art results in experiments, and offers a novel framework for inference task across spaces in prospective employments.
    Classifying Unstructured Clinical Notes via Automatic Weak Supervision. (arXiv:2206.12088v1 [cs.CL])
    Healthcare providers usually record detailed notes of the clinical care delivered to each patient for clinical, research, and billing purposes. Due to the unstructured nature of these narratives, providers employ dedicated staff to assign diagnostic codes to patients' diagnoses using the International Classification of Diseases (ICD) coding system. This manual process is not only time-consuming but also costly and error-prone. Prior work demonstrated potential utility of Machine Learning (ML) methodology in automating this process, but it has relied on large quantities of manually labeled data to train the models. Additionally, diagnostic coding systems evolve with time, which makes traditional supervised learning strategies unable to generalize beyond local applications. In this work, we introduce a general weakly-supervised text classification framework that learns from class-label descriptions only, without the need to use any human-labeled documents. It leverages the linguistic domain knowledge stored within pre-trained language models and the data programming framework to assign code labels to individual texts. We demonstrate the efficacy and flexibility of our method by comparing it to state-of-the-art weak text classifiers across four real-world text classification datasets, in addition to assigning ICD codes to medical notes in the publicly available MIMIC-III database.
    Exploring System Performance of Continual Learning for Mobile and Embedded Sensing Applications. (arXiv:2110.13290v2 [cs.LG] UPDATED)
    Continual learning approaches help deep neural network models adapt and learn incrementally by trying to solve catastrophic forgetting. However, whether these existing approaches, applied traditionally to image-based tasks, work with the same efficacy to the sequential time series data generated by mobile or embedded sensing systems remains an unanswered question. To address this void, we conduct the first comprehensive empirical study that quantifies the performance of three predominant continual learning schemes (i.e., regularization, replay, and replay with examples) on six datasets from three mobile and embedded sensing applications in a range of scenarios having different learning complexities. More specifically, we implement an end-to-end continual learning framework on edge devices. Then we investigate the generalizability, trade-offs between performance, storage, computational costs, and memory footprint of different continual learning methods. Our findings suggest that replay with exemplars-based schemes such as iCaRL has the best performance trade-offs, even in complex scenarios, at the expense of some storage space (few MBs) for training examples (1% to 5%). We also demonstrate for the first time that it is feasible and practical to run continual learning on-device with a limited memory budget. In particular, the latency on two types of mobile and embedded devices suggests that both incremental learning time (few seconds - 4 minutes) and training time (1 - 75 minutes) across datasets are acceptable, as training could happen on the device when the embedded device is charging thereby ensuring complete data privacy. Finally, we present some guidelines for practitioners who want to apply a continual learning paradigm for mobile sensing tasks.
    On making optimal transport robust to all outliers. (arXiv:2206.11988v1 [stat.ML])
    Optimal transport (OT) is known to be sensitive against outliers because of its marginal constraints. Outlier robust OT variants have been proposed based on the definition that outliers are samples which are expensive to move. In this paper, we show that this definition is restricted by considering the case where outliers are closer to the target measure than clean samples. We show that outlier robust OT fully transports these outliers leading to poor performances in practice. To tackle these outliers, we propose to detect them by relying on a classifier trained with adversarial training to classify source and target samples. A sample is then considered as an outlier if the prediction from the classifier is different from its assigned label. To decrease the influence of these outliers in the transport problem, we propose to either remove them from the problem or to increase the cost of moving them by using the classifier prediction. We show that we successfully detect these outliers and that they do not influence the transport problem on several experiments such as gradient flows, generative models and label propagation.
    How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections. (arXiv:2206.12037v1 [cs.LG])
    Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular time-varying dynamical system, and the use of this matrix as a time-invariant SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task.
    Self Supervised Learning for Few Shot Hyperspectral Image Classification. (arXiv:2206.12117v1 [cs.CV])
    Deep learning has proven to be a very effective approach for Hyperspectral Image (HSI) classification. However, deep neural networks require large annotated datasets to generalize well. This limits the applicability of deep learning for HSI classification, where manually labelling thousands of pixels for every scene is impractical. In this paper, we propose to leverage Self Supervised Learning (SSL) for HSI classification. We show that by pre-training an encoder on unlabeled pixels using Barlow-Twins, a state-of-the-art SSL algorithm, we can obtain accurate models with a handful of labels. Experimental results demonstrate that this approach significantly outperforms vanilla supervised learning.
    Implicit Channel Learning for Machine Learning Applications in 6G Wireless Networks. (arXiv:2206.12127v1 [eess.SP])
    With the deployment of the fifth generation (5G) wireless systems gathering momentum across the world, possible technologies for 6G are under active research discussions. In particular, the role of machine learning (ML) in 6G is expected to enhance and aid emerging applications such as virtual and augmented reality, vehicular autonomy, and computer vision. This will result in large segments of wireless data traffic comprising image, video and speech. The ML algorithms process these for classification/recognition/estimation through the learning models located on cloud servers. This requires wireless transmission of data from edge devices to the cloud server. Channel estimation, handled separately from recognition step, is critical for accurate learning performance. Toward combining the learning for both channel and the ML data, we introduce implicit channel learning to perform the ML tasks without estimating the wireless channel. Here, the ML models are trained with channel-corrupted datasets in place of nominal data. Without channel estimation, the proposed approach exhibits approximately 60% improvement in image and speech classification tasks for diverse scenarios such as millimeter wave and IEEE 802.11p vehicular channels.
    TreeDRNet:A Robust Deep Model for Long Term Time Series Forecasting. (arXiv:2206.12106v1 [cs.LG])
    Various deep learning models, especially some latest Transformer-based approaches, have greatly improved the state-of-art performance for long-term time series forecasting.However, those transformer-based models suffer a severe deterioration performance with prolonged input length, which prohibits them from using extended historical info.Moreover, these methods tend to handle complex examples in long-term forecasting with increased model complexity, which often leads to a significant increase in computation and less robustness in performance(e.g., overfitting). We propose a novel neural network architecture, called TreeDRNet, for more effective long-term forecasting. Inspired by robust regression, we introduce doubly residual link structure to make prediction more robust.Built upon Kolmogorov-Arnold representation theorem, we explicitly introduce feature selection, model ensemble, and a tree structure to further utilize the extended input sequence, which improves the robustness and representation power of TreeDRNet. Unlike previous deep models for sequential forecasting work, TreeDRNet is built entirely on multilayer perceptron and thus enjoys high computational efficiency. Our extensive empirical studies show that TreeDRNet is significantly more effective than state-of-the-art methods, reducing prediction errors by 20% to 40% for multivariate time series. In particular, TreeDRNet is over 10 times more efficient than transformer-based methods. The code will be released soon.
    On Structural Explanation of Bias in Graph Neural Networks. (arXiv:2206.12104v1 [cs.LG])
    Graph Neural Networks (GNNs) have shown satisfying performance in various graph analytical problems. Hence, they have become the \emph{de facto} solution in a variety of decision-making scenarios. However, GNNs could yield biased results against certain demographic subgroups. Some recent works have empirically shown that the biased structure of the input network is a significant source of bias for GNNs. Nevertheless, no studies have systematically scrutinized which part of the input network structure leads to biased predictions for any given node. The low transparency on how the structure of the input network influences the bias in GNN outcome largely limits the safe adoption of GNNs in various decision-critical scenarios. In this paper, we study a novel research problem of structural explanation of bias in GNNs. Specifically, we propose a novel post-hoc explanation framework to identify two edge sets that can maximally account for the exhibited bias and maximally contribute to the fairness level of the GNN prediction for any given node, respectively. Such explanations not only provide a comprehensive understanding of bias/fairness of GNN predictions but also have practical significance in building an effective yet fair GNN model. Extensive experiments on real-world datasets validate the effectiveness of the proposed framework towards delivering effective structural explanations for the bias of GNNs. Open-source code can be found at https://github.com/yushundong/REFEREE.
    MULTI-FLGANs: Multi-Distributed Adversarial Networks for Non-IID distribution. (arXiv:2206.12178v1 [cs.LG])
    Federated learning is an emerging concept in the domain of distributed machine learning. This concept has enabled GANs to benefit from the rich distributed training data while preserving privacy. However, in a non-iid setting, current federated GAN architectures are unstable, struggling to learn the distinct features and vulnerable to mode collapse. In this paper, we propose a novel architecture MULTI-FLGAN to solve the problem of low-quality images, mode collapse and instability for non-iid datasets. Our results show that MULTI-FLGAN is four times as stable and performant (i.e. high inception score) on average over 20 clients compared to baseline FLGAN.
    Knowledge Distillation via Weighted Ensemble of Teaching Assistants. (arXiv:2206.12005v1 [cs.LG])
    Knowledge distillation in machine learning is the process of transferring knowledge from a large model called the teacher to a smaller model called the student. Knowledge distillation is one of the techniques to compress the large network (teacher) to a smaller network (student) that can be deployed in small devices such as mobile phones. When the network size gap between the teacher and student increases, the performance of the student network decreases. To solve this problem, an intermediate model is employed between the teacher model and the student model known as the teaching assistant model, which in turn bridges the gap between the teacher and the student. In this research, we have shown that using multiple teaching assistant models, the student model (the smaller model) can be further improved. We combined these multiple teaching assistant models using weighted ensemble learning where we have used a differential evaluation optimization algorithm to generate the weight values.
    Aggregated Multi-output Gaussian Processes with Knowledge Transfer Across Domains. (arXiv:2206.12141v1 [stat.ML])
    Aggregate data often appear in various fields such as socio-economics and public security. The aggregate data are associated not with points but with supports (e.g., spatial regions in a city). Since the supports may have various granularities depending on attributes (e.g., poverty rate and crime rate), modeling such data is not straightforward. This article offers a multi-output Gaussian process (MoGP) model that infers functions for attributes using multiple aggregate datasets of respective granularities. In the proposed model, the function for each attribute is assumed to be a dependent GP modeled as a linear mixing of independent latent GPs. We design an observation model with an aggregation process for each attribute; the process is an integral of the GP over the corresponding support. We also introduce a prior distribution of the mixing weights, which allows a knowledge transfer across domains (e.g., cities) by sharing the prior. This is advantageous in such a situation where the spatially aggregated dataset in a city is too coarse to interpolate; the proposed model can still make accurate predictions of attributes by utilizing aggregate datasets in other cities. The inference of the proposed model is based on variational Bayes, which enables one to learn the model parameters using the aggregate datasets from multiple domains. The experiments demonstrate that the proposed model outperforms in the task of refining coarse-grained aggregate data on real-world datasets: Time series of air pollutants in Beijing and various kinds of spatial datasets from New York City and Chicago.
    Discrete-Continuous Smoothing and Mapping. (arXiv:2204.11936v2 [cs.RO] UPDATED)
    We describe a general approach to smoothing and mapping with a class of discrete-continuous factor graphs commonly encountered in robotics applications. While there are openly available tools providing flexible and easy-to-use interfaces for specifying and solving optimization problems formulated in terms of either discrete or continuous graphical models, at present, no similarly general tools exist enabling the same functionality for hybrid discrete-continuous problems. We aim to address this problem. In particular, we provide a library, DC-SAM, extending existing tools for optimization problems defined in terms of factor graphs to the setting of discrete-continuous models. A key contribution of our work is a novel solver for efficiently recovering approximate solutions to discrete-continuous optimization problems. The key insight to our approach is that while joint inference over continuous and discrete state spaces is often hard, many commonly encountered discrete-continuous problems can naturally be split into a "discrete part" and a "continuous part" that can individually be solved easily. Leveraging this structure, we optimize discrete and continuous variables in an alternating fashion. In consequence, our proposed work enables straightforward representation of and approximate inference in discrete-continuous graphical models. We also provide a method to recover the uncertainty in estimates of both discrete and continuous variables. We demonstrate the versatility of our approach through its application to three distinct robot perception applications: point-cloud registration, robust pose graph optimization, and object-based mapping and localization.
    Supervised learning of random quantum circuits via scalable neural networks. (arXiv:2206.10348v2 [quant-ph] UPDATED)
    Predicting the output of quantum circuits is a hard computational task that plays a pivotal role in the development of universal quantum computers. Here we investigate the supervised learning of output expectation values of random quantum circuits. Deep convolutional neural networks (CNNs) are trained to predict single-qubit and two-qubit expectation values using databases of classically simulated circuits. These circuits are represented via an appropriately designed one-hot encoding of the constituent gates. The prediction accuracy for previously unseen circuits is analyzed, also making comparisons with small-scale quantum computers available from the free IBM Quantum program. The CNNs often outperform the quantum devices, depending on the circuit depth, on the network depth, and on the training set size. Notably, our CNNs are designed to be scalable. This allows us exploiting transfer learning and performing extrapolations to circuits larger than those included in the training set. These CNNs also demonstrate remarkable resilience against noise, namely, they remain accurate even when trained on (simulated) expectation values averaged over very few measurements.
    F3: Fair and Federated Face Attribute Classification with Heterogeneous Data. (arXiv:2109.02351v3 [cs.LG] UPDATED)
    Fairness across different demographic groups is an essential criterion for face-related tasks, Face Attribute Classification (FAC) being a prominent example. Apart from this trend, Federated Learning (FL) is increasingly gaining traction as a scalable paradigm for distributed training. Existing FL approaches require data homogeneity to ensure fairness. However, this assumption is too restrictive in real-world settings. We propose F3, a novel FL framework for fair FAC under data heterogeneity. F3 adopts multiple heuristics to improve fairness across different demographic groups without requiring data homogeneity assumption. We demonstrate the efficacy of F3 by reporting empirically observed fairness measures and accuracy guarantees on popular face datasets. Our results suggest that F3 strikes a practical balance between accuracy and fairness for FAC.
    Parallel Deep Neural Networks Have Zero Duality Gap. (arXiv:2110.06482v2 [cs.LG] UPDATED)
    Training deep neural networks is a well-known highly non-convex problem. In recent works, it is shown that there is no duality gap for regularized two-layer neural networks with ReLU activation, which enables global optimization via convex programs. For multi-layer linear networks with vector outputs, we formulate convex dual problems and demonstrate that the duality gap is non-zero for depth three and deeper networks. However, by modifying the deep networks to more powerful parallel architectures, we show that the duality gap is exactly zero. Therefore, strong convex duality holds, and hence there exist equivalent convex programs that enable training deep networks to global optimality. We also demonstrate that the weight decay regularization in the parameters explicitly encourages low-rank solutions via closed-form expressions. For three-layer non-parallel ReLU networks, we show that strong duality holds for rank-1 data matrices, however, the duality gap is non-zero for whitened data matrices. Similarly, by transforming the neural network architecture into a corresponding parallel version, the duality gap vanishes.
    Segmentation-free PVC for Cardiac SPECT using a Densely-connected Multi-dimensional Dynamic Network. (arXiv:2206.12344v1 [eess.IV])
    In nuclear imaging, limited resolution causes partial volume effects (PVEs) that affect image sharpness and quantitative accuracy. Partial volume correction (PVC) methods incorporating high-resolution anatomical information from CT or MRI have been demonstrated to be effective. However, such anatomical-guided methods typically require tedious image registration and segmentation steps. Accurately segmented organ templates are also hard to obtain, particularly in cardiac SPECT imaging, due to the lack of hybrid SPECT/CT scanners with high-end CT and associated motion artifacts. Slight mis-registration/mis-segmentation would result in severe degradation in image quality after PVC. In this work, we develop a deep-learning-based method for fast cardiac SPECT PVC without anatomical information and associated organ segmentation. The proposed network involves a densely-connected multi-dimensional dynamic mechanism, allowing the convolutional kernels to be adapted based on the input images, even after the network is fully trained. Intramyocardial blood volume (IMBV) is introduced as an additional clinical-relevant loss function for network optimization. The proposed network demonstrated promising performance on 28 canine studies acquired on a GE Discovery NM/CT 570c dedicated cardiac SPECT scanner with a 64-slice CT using Technetium-99m-labeled red blood cells. This work showed that the proposed network with densely-connected dynamic mechanism produced superior results compared with the same network without such mechanism. Results also showed that the proposed network without anatomical information could produce images with statistically comparable IMBV measurements to the images generated by anatomical-guided PVC methods, which could be helpful in clinical translation.
    A Spatio-temporal Track Association Algorithm Based on Marine Vessel Automatic Identification System Data. (arXiv:2010.15921v2 [cs.LG] UPDATED)
    Tracking multiple moving objects in real-time in a dynamic threat environment is an important element in national security and surveillance system. It helps pinpoint and distinguish potential candidates posing threats from other normal objects and monitor the anomalous trajectories until intervention. To locate the anomalous pattern of movements, one needs to have an accurate data association algorithm that can associate the sequential observations of locations and motion with the underlying moving objects, and therefore, build the trajectories of the objects as the objects are moving. In this work, we develop a spatio-temporal approach for tracking maritime vessels as the vessel's location and motion observations are collected by an Automatic Identification System. The proposed approach is developed as an effort to address a data association challenge in which the number of vessels as well as the vessel identification are purposely withheld and time gaps are created in the datasets to mimic the real-life operational complexities under a threat environment. Three training datasets and five test sets are provided in the challenge and a set of quantitative performance metrics is devised by the data challenge organizer for evaluating and comparing resulting methods developed by participants. When our proposed track association algorithm is applied to the five test sets, the algorithm scores a very competitive performance.
    A Disability Lens towards Biases in GPT-3 Generated Open-Ended Languages. (arXiv:2206.11993v1 [cs.CL])
    Language models (LM) are becoming prevalent in many language-based application spaces globally. Although these LMs are improving our day-to-day interactions with digital products, concerns remain whether open-ended languages or text generated from these models reveal any biases toward a specific group of people, thereby risking the usability of a certain product. There is a need to identify whether these models possess bias to improve the fairness in these models. This gap motivates our ongoing work, where we measured the two aspects of bias in GPT-3 generated text through a disability lens.
    Similarity-aware Positive Instance Sampling for Graph Contrastive Pre-training. (arXiv:2206.11959v1 [cs.LG])
    Graph instance contrastive learning has been proved as an effective task for Graph Neural Network (GNN) pre-training. However, one key issue may seriously impede the representative power in existing works: Positive instances created by current methods often miss crucial information of graphs or even yield illegal instances (such as non-chemically-aware graphs in molecular generation). To remedy this issue, we propose to select positive graph instances directly from existing graphs in the training set, which ultimately maintains the legality and similarity to the target graphs. Our selection is based on certain domain-specific pair-wise similarity measurements as well as sampling from a hierarchical graph encoding similarity relations among graphs. Besides, we develop an adaptive node-level pre-training method to dynamically mask nodes to distribute them evenly in the graph. We conduct extensive experiments on $13$ graph classification and node classification benchmark datasets from various domains. The results demonstrate that the GNN models pre-trained by our strategies can outperform those trained-from-scratch models as well as the variants obtained by existing methods.
    Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings. (arXiv:2206.12081v1 [cs.LG])
    We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emission process, and the latent state transition is deterministic. Under the function approximation setup where the optimal latent state-action $Q$-function is linear in the state feature, and the optimal $Q$-function has a gap in actions, we provide a \emph{computationally and statistically efficient} algorithm for finding the \emph{exact optimal} policy. We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space. Furthermore, we show both the deterministic latent transitions and gap assumptions are necessary to avoid statistical complexity exponential in horizon or dimension. Since our guarantee does not have an explicit dependence on the size of the state and observation spaces, our algorithm provably scales to large-scale POMDPs.
    Neural Networks with A La Carte Selection of Activation Functions. (arXiv:2206.12166v1 [cs.NE])
    Activation functions (AFs), which are pivotal to the success (or failure) of a neural network, have received increased attention in recent years, with researchers seeking to design novel AFs that improve some aspect of network performance. In this paper we take another direction, wherein we combine a slew of known AFs into successful architectures, proposing three methods to do so beneficially: 1) generate AF architectures at random, 2) use Optuna, an automatic hyper-parameter optimization software framework, with a Tree-structured Parzen Estimator (TPE) sampler, and 3) use Optuna with a Covariance Matrix Adaptation Evolution Strategy (CMA-ES) sampler. We show that all methods often produce significantly better results for 25 classification problems when compared with a standard network composed of ReLU hidden units and a softmax output unit. Optuna with the TPE sampler emerged as the best AF architecture-producing method.  ( 2 min )
    Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems. (arXiv:2206.12020v1 [cs.LG])
    We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.  ( 2 min )
    Bilateral Network with Channel Splitting Network and Transformer for Thermal Image Super-Resolution. (arXiv:2206.12046v1 [cs.CV])
    In recent years, the Thermal Image Super-Resolution (TISR) problem has become an attractive research topic. TISR would been used in a wide range of fields, including military, medical, agricultural and animal ecology. Due to the success of PBVS-2020 and PBVS-2021 workshop challenge, the result of TISR keeps improving and attracts more researchers to sign up for PBVS-2022 challenge. In this paper, we will introduce the technical details of our submission to PBVS-2022 challenge designing a Bilateral Network with Channel Splitting Network and Transformer(BN-CSNT) to tackle the TISR problem. Firstly, we designed a context branch based on channel splitting network with transformer to obtain sufficient context information. Secondly, we designed a spatial branch with shallow transformer to extract low level features which can preserve the spatial information. Finally, for the context branch in order to fuse the features from channel splitting network and transformer, we proposed an attention refinement module, and then features from context branch and spatial branch are fused by proposed feature fusion module. The proposed method can achieve PSNR=33.64, SSIM=0.9263 for x4 and PSNR=21.08, SSIM=0.7803 for x2 in the PBVS-2022 challenge test dataset.  ( 2 min )
    Learning quantum symmetries with interactive quantum-classical variational algorithms. (arXiv:2206.11970v1 [quant-ph])
    A symmetry of a state $\lvert \psi \rangle$ is a unitary operator of which $\lvert \psi \rangle$ is an eigenvector. When $\lvert \psi \rangle$ is an unknown state supplied by a black-box oracle, the state's symmetries serve to characterize it, and often relegate much of the desired information about $\lvert \psi \rangle$. In this paper, we develop a variational hybrid quantum-classical learning scheme to systematically probe for symmetries of $\lvert \psi \rangle$ with no a priori assumptions about the state. This procedure can be used to learn various symmetries at the same time. In order to avoid re-learning already known symmetries, we introduce an interactive protocol with a classical deep neural net. The classical net thereby regularizes against repetitive findings and allows our algorithm to terminate empirically with all possible symmetries found. Our scheme can be implemented efficiently on average with non-local SWAP gates; we also give a less efficient algorithm with only local operations, which may be more appropriate for current noisy quantum devices. We demonstrate our algorithm on representative families of states.  ( 2 min )
    The Real Deal: A Review of Challenges and Opportunities in Moving Reinforcement Learning-Based Traffic Signal Control Systems Towards Reality. (arXiv:2206.11996v1 [cs.AI])
    Traffic signal control (TSC) is a high-stakes domain that is growing in importance as traffic volume grows globally. An increasing number of works are applying reinforcement learning (RL) to TSC; RL can draw on an abundance of traffic data to improve signalling efficiency. However, RL-based signal controllers have never been deployed. In this work, we provide the first review of challenges that must be addressed before RL can be deployed for TSC. We focus on four challenges involving (1) uncertainty in detection, (2) reliability of communications, (3) compliance and interpretability, and (4) heterogeneous road users. We show that the literature on RL-based TSC has made some progress towards addressing each challenge. However, more work should take a systems thinking approach that considers the impacts of other pipeline components on RL.  ( 2 min )
    PSP: Million-level Protein Sequence Dataset for Protein Structure Prediction. (arXiv:2206.12240v1 [q-bio.BM])
    Proteins are essential component of human life and their structures are important for function and mechanism analysis. Recent work has shown the potential of AI-driven methods for protein structure prediction. However, the development of new models is restricted by the lack of dataset and benchmark training procedure. To the best of our knowledge, the existing open source datasets are far less to satisfy the needs of modern protein sequence-structure related research. To solve this problem, we present the first million-level protein structure prediction dataset with high coverage and diversity, named as PSP. This dataset consists of 570k true structure sequences (10TB) and 745k complementary distillation sequences (15TB). We provide in addition the benchmark training procedure for SOTA protein structure prediction model on this dataset. We validate the utility of this dataset for training by participating CAMEO contest in which our model won the first place. We hope our PSP dataset together with the training benchmark can enable a broader community of AI/biology researchers for AI-driven protein related research.
    BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping. (arXiv:2206.12038v1 [cs.SD])
    Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural networks can extract optimal embeddings if they are trained on large audio datasets. This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. Lastly, we present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks. Our results indicate that the hybrid model with a convolutional transformer as the encoder yields superior performance in most HEAR challenge tasks.  ( 2 min )
    Efficient and Accurate Top-$K$ Recovery from Choice Data. (arXiv:2206.11995v1 [cs.LG])
    The intersection of learning to rank and choice modeling is an active area of research with applications in e-commerce, information retrieval and the social sciences. In some applications such as recommendation systems, the statistician is primarily interested in recovering the set of the top ranked items from a large pool of items as efficiently as possible using passively collected discrete choice data, i.e., the user picks one item from a set of multiple items. Motivated by this practical consideration, we propose the choice-based Borda count algorithm as a fast and accurate ranking algorithm for top $K$-recovery i.e., correctly identifying all of the top $K$ items. We show that the choice-based Borda count algorithm has optimal sample complexity for top-$K$ recovery under a broad class of random utility models. We prove that in the limit, the choice-based Borda count algorithm produces the same top-$K$ estimate as the commonly used Maximum Likelihood Estimate method but the former's speed and simplicity brings considerable advantages in practice. Experiments on both synthetic and real datasets show that the counting algorithm is competitive with commonly used ranking algorithms in terms of accuracy while being several orders of magnitude faster.  ( 2 min )
    Sampling Enclosing Subgraphs for Link Prediction. (arXiv:2206.12004v1 [cs.LG])
    Link prediction is a fundamental problem for graph-structured data (e.g., social networks, drug side-effect networks, etc.). Graph neural networks have offered robust solutions for this problem, specifically by learning the representation of the subgraph enclosing the target link (i.e., pair of nodes). However, these solutions do not scale well to large graphs as extraction and operation on enclosing subgraphs are computationally expensive, especially for large graphs. This paper presents a scalable link prediction solution, that we call ScaLed, which utilizes sparse enclosing subgraphs to make predictions. To extract sparse enclosing subgraphs, ScaLed takes multiple random walks from a target pair of nodes, then operates on the sampled enclosing subgraph induced by all visited nodes. By leveraging the smaller sampled enclosing subgraph, ScaLed can scale to larger graphs with much less overhead while maintaining high accuracy. ScaLed further provides the flexibility to control the trade-off between computation overhead and accuracy. Through comprehensive experiments, we have shown that ScaLed can produce comparable accuracy to those reported by the existing subgraph representation learning frameworks while being less computationally demanding.  ( 2 min )
    Approximating 1-Wasserstein Distance with Trees. (arXiv:2206.12116v1 [stat.ML])
    Wasserstein distance, which measures the discrepancy between distributions, shows efficacy in various types of natural language processing (NLP) and computer vision (CV) applications. One of the challenges in estimating Wasserstein distance is that it is computationally expensive and does not scale well for many distribution comparison tasks. In this paper, we aim to approximate the 1-Wasserstein distance by the tree-Wasserstein distance (TWD), where TWD is a 1-Wasserstein distance with tree-based embedding and can be computed in linear time with respect to the number of nodes on a tree. More specifically, we propose a simple yet efficient L1-regularized approach to learning the weights of the edges in a tree. To this end, we first show that the 1-Wasserstein approximation problem can be formulated as a distance approximation problem using the shortest path distance on a tree. We then show that the shortest path distance can be represented by a linear model and can be formulated as a Lasso-based regression problem. Owing to the convex formulation, we can obtain a globally optimal solution efficiently. Moreover, we propose a tree-sliced variant of these methods. Through experiments, we demonstrated that the weighted TWD can accurately approximate the original 1-Wasserstein distance.  ( 2 min )
    Measuring Representational Robustness of Neural Networks Through Shared Invariances. (arXiv:2206.11939v1 [cs.LG])
    A major challenge in studying robustness in deep learning is defining the set of ``meaningless'' perturbations to which a given Neural Network (NN) should be invariant. Most work on robustness implicitly uses a human as the reference model to define such perturbations. Our work offers a new view on robustness by using another reference NN to define the set of perturbations a given NN should be invariant to, thus generalizing the reliance on a reference ``human NN'' to any NN. This makes measuring robustness equivalent to measuring the extent to which two NNs share invariances, for which we propose a measure called STIR. STIR re-purposes existing representation similarity measures to make them suitable for measuring shared invariances. Using our measure, we are able to gain insights into how shared invariances vary with changes in weight initialization, architecture, loss functions, and training dataset. Our implementation is available at: \url{https://github.com/nvedant07/STIR}.
    "You Can't Fix What You Can't Measure": Privately Measuring Demographic Performance Disparities in Federated Learning. (arXiv:2206.12183v1 [cs.LG])
    Federated learning allows many devices to collaborate in the training of machine learning models. As in traditional machine learning, there is a growing concern that models trained with federated learning may exhibit disparate performance for different demographic groups. Existing solutions to measure and ensure equal model performance across groups require access to information about group membership, but this access is not always available or desirable, especially under the privacy aspirations of federated learning. We study the feasibility of measuring such performance disparities while protecting the privacy of the user's group membership and the federated model's performance on the user's data. Protecting both is essential for privacy, because they may be correlated, and thus learning one may reveal the other. On the other hand, from the utility perspective, the privacy-preserved data should maintain the correlation to ensure the ability to perform accurate measurements of the performance disparity. We achieve both of these goals by developing locally differentially private mechanisms that preserve the correlations between group membership and model performance. To analyze the effectiveness of the mechanisms, we bound their error in estimating the disparity when optimized for a given privacy budget, and validate these bounds on synthetic data. Our results show that the error rapidly decreases for realistic numbers of participating clients, demonstrating that, contrary to what prior work suggested, protecting the privacy of protected attributes is not necessarily in conflict with identifying disparities in the performance of federated models.
    Task-Adaptive Few-shot Node Classification. (arXiv:2206.11972v1 [cs.LG])
    Node classification is of great importance among various graph mining tasks. In practice, real-world graphs generally follow the long-tail distribution, where a large number of classes only consist of limited labeled nodes. Although Graph Neural Networks (GNNs) have achieved significant improvements in node classification, their performance decreases substantially in such a few-shot scenario. The main reason can be attributed to the vast generalization gap between meta-training and meta-test due to the task variance caused by different node/class distributions in meta-tasks (i.e., node-level and class-level variance). Therefore, to effectively alleviate the impact of task variance, we propose a task-adaptive node classification framework under the few-shot learning setting. Specifically, we first accumulate meta-knowledge across classes with abundant labeled nodes. Then we transfer such knowledge to the classes with limited labeled nodes via our proposed task-adaptive modules. In particular, to accommodate the different node/class distributions among meta-tasks, we propose three essential modules to perform \emph{node-level}, \emph{class-level}, and \emph{task-level} adaptations in each meta-task, respectively. In this way, our framework can conduct adaptations to different meta-tasks and thus advance the model generalization performance on meta-test tasks. Extensive experiments on four prevalent node classification datasets demonstrate the superiority of our framework over the state-of-the-art baselines. Our code is provided at https://github.com/SongW-SW/TENT.
    Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs. (arXiv:2206.11990v1 [cs.LG])
    3D-related inductive biases like translational invariance and rotational equivariance are indispensable to graph neural networks operating on 3D atomistic graphs such as molecules. Inspired by the success of Transformers in various domains, we study how to incorporate these inductive biases into Transformers. In this paper, we present Equiformer, a graph neural network leveraging the strength of Transformer architectures and incorporating $SE(3)/E(3)$-equivariant features based on irreducible representations (irreps). Irreps features encode equivariant information in channel dimensions without complicating graph structures. The simplicity enables us to directly incorporate them by replacing original operations with equivariant counterparts. Moreover, to better adapt Transformers to 3D graphs, we propose a novel equivariant graph attention, which considers both content and geometric information such as relative position contained in irreps features. To improve expressivity of the attention, we replace dot product attention with multi-layer perceptron attention and include non-linear message passing. We benchmark Equiformer on two quantum properties prediction datasets, QM9 and OC20. For QM9, among models trained with the same data partition, Equiformer achieves best results on 11 out of 12 regression tasks. For OC20, under the setting of training with IS2RE data and optionally IS2RS data, Equiformer improves upon state-of-the-art models. Code reproducing all main results will be available soon.
  • Open

    Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. (arXiv:2106.04156v7 [cs.LG] UPDATED)
    Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together while keeping negative pairs far apart. Despite the empirical successes, theoretical foundations are limited -- prior analyses assume conditional independence of the positive pairs given the same class label, but recent empirical applications use heavily correlated positive pairs (i.e., data augmentations of the same image). Our work analyzes contrastive learning without assuming conditional independence of positive pairs using a novel concept of the augmentation graph on data. Edges in this graph connect augmentations of the same data, and ground-truth classes naturally form connected sub-graphs. We propose a loss that performs spectral decomposition on the population augmentation graph and can be succinctly written as a contrastive learning objective on neural net representations. Minimizing this objective leads to features with provable accuracy guarantees under linear probe evaluation. By standard generalization bounds, these accuracy guarantees also hold when minimizing the training contrastive loss. Empirically, the features learned by our objective can match or outperform several strong baselines on benchmark vision datasets. In all, this work provides the first provable analysis for contrastive learning where guarantees for linear probe evaluation can apply to realistic empirical settings.
    RARTS: An Efficient First-Order Relaxed Architecture Search Method. (arXiv:2008.03901v2 [cs.LG] UPDATED)
    Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. Despite its success in many architecture search tasks, there are still some concerns about the accuracy of first-order DARTS and the efficiency of the second-order DARTS. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes the whole dataset in architecture learning via both data and network splitting, without involving mixed second derivatives of the corresponding loss functions like DARTS. In our formulation of network splitting, two networks with different but related weights cooperate in search of a shared architecture. The advantage of RARTS over DARTS is justified by a convergence theorem and an analytically solvable model. Moreover, RARTS outperforms DARTS and its variants in accuracy and search efficiency, as shown in adequate experimental results. For the task of searching topological architecture, i.e., the edges and the operations, RARTS obtains a higher accuracy and 60\% reduction of computational cost than second-order DARTS on CIFAR-10. RARTS continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm without modifying search space. For the task of searching width, i.e., the number of channels in convolutional layers, RARTS also outperforms the traditional network pruning benchmarks. Further experiments on the public architecture search benchmark like NATS-Bench also support the preeminence of RARTS.  ( 3 min )
    Multi-Frequency Joint Community Detection and Phase Synchronization. (arXiv:2206.12276v1 [cs.SI])
    This paper studies the joint community detection and phase synchronization problem on the \textit{stochastic block model with relative phase}, where each node is associated with a phase. This problem, with a variety of real-world applications, aims to recover community memberships and associated phases simultaneously. By studying the maximum likelihood estimation formulation, we show that this problem exhibits a \textit{``multi-frequency''} structure. To this end, two simple yet efficient algorithms that leverage information across multiple frequencies are proposed. The former is a spectral method based on the novel multi-frequency column-pivoted QR factorization, and the latter is an iterative multi-frequency generalized power method. Numerical experiments indicate our proposed algorithms outperform state-of-the-art algorithms, in recovering community memberships and associated phases.  ( 2 min )
    Simplified and Unified Analysis of Various Learning Problems by Reduction to Multiple-Instance Learning. (arXiv:1911.05999v4 [cs.LG] UPDATED)
    In statistical learning, many problem formulations have been proposed so far, such as multi-class learning, complementarily labeled learning, multi-label learning, multi-task learning, which provide theoretical models for various real-world tasks. Although they have been extensively studied, the relationship among them has not been fully investigated. In this work, we focus on a particular problem formulation called Multiple-Instance Learning (MIL), and show that various learning problems including all the problems mentioned above with some of new problems can be reduced to MIL with theoretically guaranteed generalization bounds, where the reductions are established under a new reduction scheme we provide as a by-product. The results imply that the MIL-reduction gives a simplified and unified framework for designing and analyzing algorithms for various learning problems. Moreover, we show that the MIL-reduction framework can be kernelized.  ( 2 min )
    A Unified Statistical Learning Model for Rankings and Scores with Application to Grant Panel Review. (arXiv:2201.02539v2 [stat.ME] UPDATED)
    Rankings and scores are two common data types used by judges to express preferences and/or perceptions of quality in a collection of objects. Numerous models exist to study data of each type separately, but no unified statistical model captures both data types simultaneously without first performing data conversion. We propose the Mallows-Binomial model to close this gap, which combines a Mallows' $\phi$ ranking model with Binomial score models through shared parameters that quantify object quality, a consensus ranking, and the level of consensus between judges. We propose an efficient tree-search algorithm to calculate the exact MLE of model parameters, study statistical properties of the model both analytically and through simulation, and apply our model to real data from an instance of grant panel review that collected both scores and partial rankings. Furthermore, we demonstrate how model outputs can be used to rank objects with confidence. The proposed model is shown to sensibly combine information from both scores and rankings to quantify object quality and measure consensus with appropriate levels of statistical uncertainty.  ( 2 min )
    Unified field theoretical approach to deep and recurrent neuronal networks. (arXiv:2112.05589v3 [cond-mat.dis-nn] UPDATED)
    Understanding capabilities and limitations of different network architectures is of fundamental importance to machine learning. Bayesian inference on Gaussian processes has proven to be a viable approach for studying recurrent and deep networks in the limit of infinite layer width, $n\to\infty$. Here we present a unified and systematic derivation of the mean-field theory for both architectures that starts from first principles by employing established methods from statistical physics of disordered systems. The theory elucidates that while the mean-field equations are different with regard to their temporal structure, they yet yield identical Gaussian kernels when readouts are taken at a single time point or layer, respectively. Bayesian inference applied to classification then predicts identical performance and capabilities for the two architectures. Numerically, we find that convergence towards the mean-field theory is typically slower for recurrent networks than for deep networks and the convergence speed depends non-trivially on the parameters of the weight prior as well as the depth or number of time steps, respectively. Our method exposes that Gaussian processes are but the lowest order of a systematic expansion in $1/n$ and we compute next-to-leading-order corrections which turn out to be architecture-specific. The formalism thus paves the way to investigate the fundamental differences between recurrent and deep architectures at finite widths $n$.  ( 3 min )
    Inductive Biases and Variable Creation in Self-Attention Mechanisms. (arXiv:2110.10090v2 [cs.LG] UPDATED)
    Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules. Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer networks "create sparse variables": a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length. To support our analysis, we present synthetic experiments to probe the sample complexity of learning sparse Boolean functions with Transformers.  ( 2 min )
    Quantifying Inherent Randomness in Machine Learning Algorithms. (arXiv:2206.12353v1 [stat.ML])
    Most machine learning (ML) algorithms have several stochastic elements, and their performances are affected by these sources of randomness. This paper uses an empirical study to systematically examine the effects of two sources: randomness in model training and randomness in the partitioning of a dataset into training and test subsets. We quantify and compare the magnitude of the variation in predictive performance for the following ML algorithms: Random Forests (RFs), Gradient Boosting Machines (GBMs), and Feedforward Neural Networks (FFNNs). Among the different algorithms, randomness in model training causes larger variation for FFNNs compared to tree-based methods. This is to be expected as FFNNs have more stochastic elements that are part of their model initialization and training. We also found that random splitting of datasets leads to higher variation compared to the inherent randomness from model training. The variation from data splitting can be a major issue if the original dataset has considerable heterogeneity. Keywords: Model Training, Reproducibility, Variation  ( 2 min )
    Generalizing to New Physical Systems via Context-Informed Dynamics Model. (arXiv:2202.01889v3 [cs.LG] UPDATED)
    Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new dynamics. CoDA leverages multiple environments, each associated to a different dynamic, and learns to condition the dynamics model on contextual parameters, specific to each environment. The conditioning is performed via a hypernetwork, learned jointly with a context vector from observed data. The proposed formulation constrains the search hypothesis space to foster fast adaptation and better generalization across environments. We theoretically motivate our approach and show state-of-the-art generalization results on a set of nonlinear dynamics, representative of a variety of application domains. We also show, on these systems, that new system parameters can be inferred from context vectors with minimal supervision. Code is available at https://github.com/yuan-yin/CoDA .  ( 2 min )
    Affinity-Aware Graph Networks. (arXiv:2206.11941v1 [cs.LG])
    Graph Neural Networks (GNNs) have emerged as a powerful technique for learning on relational data. Owing to the relatively limited number of message passing steps they perform -- and hence a smaller receptive field -- there has been significant interest in improving their expressivity by incorporating structural aspects of the underlying graph. In this paper, we explore the use of affinity measures as features in graph neural networks, in particular measures arising from random walks, including effective resistance, hitting and commute times. We propose message passing networks based on these features and evaluate their performance on a variety of node and graph property prediction tasks. Our architecture has lower computational complexity, while our features are invariant to the permutations of the underlying graph. The measures we compute allow the network to exploit the connectivity properties of the graph, thereby allowing us to outperform relevant benchmarks for a wide variety of tasks, often with significantly fewer message passing steps. On one of the largest publicly available graph regression datasets, OGB-LSC-PCQM4Mv1, we obtain the best known single-model validation MAE at the time of writing.  ( 2 min )
    Graph-Coupled Oscillator Networks. (arXiv:2202.02296v2 [cs.LG] UPDATED)
    We propose Graph-Coupled Oscillator Networks (GraphCON), a novel framework for deep learning on graphs. It is based on discretizations of a second-order system of ordinary differential equations (ODEs), which model a network of nonlinear controlled and damped oscillators, coupled via the adjacency structure of the underlying graph. The flexibility of our framework permits any basic GNN layer (e.g. convolutional or attentional) as the coupling function, from which a multi-layer deep neural network is built up via the dynamics of the proposed ODEs. We relate the oversmoothing problem, commonly encountered in GNNs, to the stability of steady states of the underlying ODE and show that zero-Dirichlet energy steady states are not stable for our proposed ODEs. This demonstrates that the proposed framework mitigates the oversmoothing problem. Moreover, we prove that GraphCON mitigates the exploding and vanishing gradients problem to facilitate training of deep multi-layer GNNs. Finally, we show that our approach offers competitive performance with respect to the state-of-the-art on a variety of graph-based learning tasks.  ( 2 min )
    On the Limitations of Elo: Real-World Games, are Transitive, not Additive. (arXiv:2206.12301v1 [cs.GT])
    Real-world competitive games, such as chess, go, or StarCraft II, rely on Elo models to measure the strength of their players. Since these games are not fully transitive, using Elo implicitly assumes they have a strong transitive component that can correctly be identified and extracted. In this study, we investigate the challenge of identifying the strength of the transitive component in games. First, we show that Elo models can fail to extract this transitive component, even in elementary transitive games. Then, based on this observation, we propose an extension of the Elo score: we end up with a disc ranking system that assigns each player two scores, which we refer to as skill and consistency. Finally, we propose an empirical validation on payoff matrices coming from real-world games played by bots and humans.  ( 2 min )
    The MELODIC family for simultaneous binary logistic regression in a reduced space. (arXiv:2102.08232v2 [stat.ME] UPDATED)
    Logistic regression is a commonly used method for binary classification. Researchers often have more than a single binary response variable and simultaneous analysis is beneficial because it provides insight into the dependencies among response variables as well as between the predictor variables and the responses. Moreover, in such a simultaneous analysis the equations can lend each other strength, which might increase predictive accuracy. In this paper, we propose the MELODIC family for simultaneous binary logistic regression modeling. In this family, the regression models are defined in a Euclidean space of reduced dimension, based on a distance rule. The model may be interpreted in terms of logistic regression coefficients or in terms of a biplot. We discuss a fast iterative majorization (or MM) algorithm for parameter estimation. Two applications are shown in detail: one relating personality characteristics to drug consumption profiles and one relating personality characteristics to depressive and anxiety disorders. We present a thorough comparison of our MELODIC family with alternative approaches for multivariate binary data.  ( 2 min )
    Accelerated Information Gradient flow. (arXiv:1909.02102v3 [math.OC] UPDATED)
    We present a framework for Nesterov's accelerated gradient flows in probability space to design efficient mean-field Markov chain Monte Carlo (MCMC) algorithms for Bayesian inverse problems. Here four examples of information metrics are considered, including Fisher-Rao metric, Wasserstein-2 metric, Kalman-Wasserstein metric and Stein metric. For both Fisher-Rao and Wasserstein-2 metrics, we prove convergence properties of accelerated gradient flows. In implementations, we propose a sampling-efficient discrete-time algorithm for Wasserstein-2, Kalman-Wasserstein and Stein accelerated gradient flows with a restart technique. We also formulate a kernel bandwidth selection method, which learns the gradient of logarithm of density from Brownian-motion samples. Numerical experiments, including Bayesian logistic regression and Bayesian neural network, show the strength of the proposed methods compared with state-of-the-art algorithms.  ( 2 min )
    Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings. (arXiv:2206.12081v1 [cs.LG])
    We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emission process, and the latent state transition is deterministic. Under the function approximation setup where the optimal latent state-action $Q$-function is linear in the state feature, and the optimal $Q$-function has a gap in actions, we provide a \emph{computationally and statistically efficient} algorithm for finding the \emph{exact optimal} policy. We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space. Furthermore, we show both the deterministic latent transitions and gap assumptions are necessary to avoid statistical complexity exponential in horizon or dimension. Since our guarantee does not have an explicit dependence on the size of the state and observation spaces, our algorithm provably scales to large-scale POMDPs.  ( 2 min )
    From Tensor Network Quantum States to Tensorial Recurrent Neural Networks. (arXiv:2206.12363v1 [quant-ph])
    We show that any matrix product state (MPS) can be exactly represented by a recurrent neural network (RNN) with a linear memory update. We generalize this RNN architecture to 2D lattices using a multilinear memory update. It supports perfect sampling and wave function evaluation in polynomial time, and can represent an area law of entanglement entropy. Numerical evidence shows that it can encode the wave function using a bond dimension lower by orders of magnitude when compared to MPS, with an accuracy that can be systematically improved by increasing the bond dimension.  ( 2 min )
    Deep learning algorithms for solving high dimensional nonlinear backward stochastic differential equations. (arXiv:2010.01319v3 [math.NA] UPDATED)
    In this work, we propose a new deep learning-based scheme for solving high dimensional nonlinear backward stochastic differential equations (BSDEs). The idea is to reformulate the problem as a global optimization, where the local loss functions are included. Essentially, we approximate the unknown solution of a BSDE using a deep neural network and its gradient with automatic differentiation. The approximations are performed by globally minimizing the quadratic local loss function defined at each time step, which always includes the terminal condition. This kind of loss functions are obtained by iterating the Euler discretization of the time integrals with the terminal condition. Our formulation can prompt the stochastic gradient descent algorithm not only to take the accuracy at each time layer into account, but also converge to a good local minima. In order to demonstrate performances of our algorithm, several high-dimensional nonlinear BSDEs including pricing problems in finance are provided.  ( 2 min )
    Animal Behavior Classification via Deep Learning on Embedded Systems. (arXiv:2111.12295v2 [cs.LG] UPDATED)
    We develop an end-to-end deep-neural-network-based algorithm for classifying animal behavior using accelerometry data on the embedded system of an artificial intelligence of things (AIoT) device installed in a wearable collar tag. The proposed algorithm jointly performs feature extraction and classification utilizing a set of infinite-impulse-response (IIR) and finite-impulse-response (FIR) filters together with a multilayer perceptron. The utilized IIR and FIR filters can be viewed as specific types of recurrent and convolutional neural network layers, respectively. We evaluate the performance of the proposed algorithm via two real-world datasets collected from total eighteen grazing beef cattle using collar tags. The results show that the proposed algorithm offers good intra- and inter-dataset classification accuracy and outperforms its closest contenders including two state-of-the-art convolutional-neural-network-based time-series classification algorithms, which are significantly more complex. We implement the proposed algorithm on the embedded system of the utilized collar tags' AIoT device to perform in-situ classification of animal behavior. We achieve real-time in-situ behavior inference from accelerometry data without imposing any strain on the available computational, memory, or energy resources of the embedded system.  ( 2 min )
    Learning to Predict Graphs with Fused Gromov-Wasserstein Barycenters. (arXiv:2202.03813v3 [stat.ML] UPDATED)
    This paper introduces a novel and generic framework to solve the flagship task of supervised labeled graph prediction by leveraging Optimal Transport tools. We formulate the problem as regression with the Fused Gromov-Wasserstein (FGW) loss and propose a predictive model relying on a FGW barycenter whose weights depend on inputs. First we introduce a non-parametric estimator based on kernel ridge regression for which theoretical results such as consistency and excess risk bound are proved. Next we propose an interpretable parametric model where the barycenter weights are modeled with a neural network and the graphs on which the FGW barycenter is calculated are additionally learned. Numerical experiments show the strength of the method and its ability to interpolate in the labeled graph space on simulated data and on a difficult metabolic identification problem where it can reach very good performance with very little engineering.  ( 2 min )
    Empirical and Instance-Dependent Estimation of Markov Chain and Mixing Time. (arXiv:1912.06845v3 [math.PR] UPDATED)
    We tackle the problem of estimating the mixing time of a Markov chain from a single trajectory of observations. In contrast with previous works which considered Hilbert space methods to estimate spectral gaps, we opt for an approach based on contraction with respect to total variation. Specifically, we define and estimate a generalized contraction coefficient based on Dobrushin's. We show that this quantity -- unlike the spectral gap -- controls the mixing time up to strong universal constants and remains valid for non-reversible chains. We design fully data-dependent confidence intervals around the coefficient, which are both easier to compute and thinner than their spectral counterparts. Furthermore, we initiate the beyond worst-case analysis, by showing how to leverage additional information about the transition matrix in order to obtain instance-dependent rates for its estimation with respect to the induced uniform norm, as well as some of its mixing properties.  ( 2 min )
    Learning sparse features can lead to overfitting in neural networks. (arXiv:2206.12314v1 [stat.ML])
    It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.  ( 2 min )
    Aggregated Multi-output Gaussian Processes with Knowledge Transfer Across Domains. (arXiv:2206.12141v1 [stat.ML])
    Aggregate data often appear in various fields such as socio-economics and public security. The aggregate data are associated not with points but with supports (e.g., spatial regions in a city). Since the supports may have various granularities depending on attributes (e.g., poverty rate and crime rate), modeling such data is not straightforward. This article offers a multi-output Gaussian process (MoGP) model that infers functions for attributes using multiple aggregate datasets of respective granularities. In the proposed model, the function for each attribute is assumed to be a dependent GP modeled as a linear mixing of independent latent GPs. We design an observation model with an aggregation process for each attribute; the process is an integral of the GP over the corresponding support. We also introduce a prior distribution of the mixing weights, which allows a knowledge transfer across domains (e.g., cities) by sharing the prior. This is advantageous in such a situation where the spatially aggregated dataset in a city is too coarse to interpolate; the proposed model can still make accurate predictions of attributes by utilizing aggregate datasets in other cities. The inference of the proposed model is based on variational Bayes, which enables one to learn the model parameters using the aggregate datasets from multiple domains. The experiments demonstrate that the proposed model outperforms in the task of refining coarse-grained aggregate data on real-world datasets: Time series of air pollutants in Beijing and various kinds of spatial datasets from New York City and Chicago.  ( 3 min )
    Regret Bounds for Noise-Free Kernel-Based Bandits. (arXiv:2002.05096v2 [stat.ML] UPDATED)
    Kernel-based bandit is an extensively studied black-box optimization problem, in which the objective function is assumed to live in a known reproducing kernel Hilbert space. While nearly optimal regret bounds (up to logarithmic factors) are established in the noisy setting, surprisingly, less is known about the noise-free setting (when the exact values of the underlying function is accessible without observation noise). We discuss several upper bounds on regret; none of which seem order optimal, and provide a conjecture on the order optimal regret bound.  ( 2 min )
    Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems. (arXiv:2206.12020v1 [cs.LG])
    We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.  ( 2 min )
    Deep Stable neural networks: large-width asymptotics and convergence rates. (arXiv:2108.02316v2 [cs.LG] UPDATED)
    In modern deep learning, there is a recent and growing literature on the interplay between large-width asymptotic properties of deep Gaussian neural networks (NNs), i.e. deep NNs with Gaussian-distributed weights, and Gaussian stochastic processes (SPs). Such an interplay has proved to be critical in Bayesian inference under Gaussian SP priors, kernel regression for infinitely wide deep NNs trained via gradient descent, and information propagation within infinitely wide NNs. Motivated by empirical analyses that show the potential of replacing Gaussian distributions with Stable distributions for the NN's weights, in this paper we present a rigorous analysis of the large-width asymptotic behaviour of (fully connected) feed-forward deep Stable NNs, i.e. deep NNs with Stable-distributed weights. We show that as the width goes to infinity jointly over the NN's layers, i.e. the ``joint growth" setting, a rescaled deep Stable NN converges weakly to a Stable SP whose distribution is characterized recursively through the NN's layers. Because of the non-triangular structure of the NN, this is a non-standard asymptotic problem, to which we propose an inductive approach of independent interest. Then, we establish sup-norm convergence rates of the rescaled deep Stable NN to the Stable SP, under the ``joint growth" and a ``sequential growth" of the width over the NN's layers. Such a result provides the difference between the ``joint growth" and the ``sequential growth" settings, showing that the former leads to a slower rate than the latter, depending on the depth of the layer and the number of inputs of the NN. Our work extends some recent results on infinitely wide limits for deep Gaussian NNs to the more general deep Stable NNs, providing the first result on convergence rates in the ``joint growth" setting.  ( 3 min )
    On making optimal transport robust to all outliers. (arXiv:2206.11988v1 [stat.ML])
    Optimal transport (OT) is known to be sensitive against outliers because of its marginal constraints. Outlier robust OT variants have been proposed based on the definition that outliers are samples which are expensive to move. In this paper, we show that this definition is restricted by considering the case where outliers are closer to the target measure than clean samples. We show that outlier robust OT fully transports these outliers leading to poor performances in practice. To tackle these outliers, we propose to detect them by relying on a classifier trained with adversarial training to classify source and target samples. A sample is then considered as an outlier if the prediction from the classifier is different from its assigned label. To decrease the influence of these outliers in the transport problem, we propose to either remove them from the problem or to increase the cost of moving them by using the classifier prediction. We show that we successfully detect these outliers and that they do not influence the transport problem on several experiments such as gradient flows, generative models and label propagation.  ( 2 min )
    Approximating 1-Wasserstein Distance with Trees. (arXiv:2206.12116v1 [stat.ML])
    Wasserstein distance, which measures the discrepancy between distributions, shows efficacy in various types of natural language processing (NLP) and computer vision (CV) applications. One of the challenges in estimating Wasserstein distance is that it is computationally expensive and does not scale well for many distribution comparison tasks. In this paper, we aim to approximate the 1-Wasserstein distance by the tree-Wasserstein distance (TWD), where TWD is a 1-Wasserstein distance with tree-based embedding and can be computed in linear time with respect to the number of nodes on a tree. More specifically, we propose a simple yet efficient L1-regularized approach to learning the weights of the edges in a tree. To this end, we first show that the 1-Wasserstein approximation problem can be formulated as a distance approximation problem using the shortest path distance on a tree. We then show that the shortest path distance can be represented by a linear model and can be formulated as a Lasso-based regression problem. Owing to the convex formulation, we can obtain a globally optimal solution efficiently. Moreover, we propose a tree-sliced variant of these methods. Through experiments, we demonstrated that the weighted TWD can accurately approximate the original 1-Wasserstein distance.  ( 2 min )

  • Open

    [D] Paper Explained - Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos (Video Analysis)
    https://youtu.be/oz5yZc9ULAc Minecraft is one of the harder challenges any RL agent could face. Episodes are long, and the world is procedurally generated, complex, and huge. Further, the action space is a keyboard and a mouse, which has to be operated only given the game's video input. OpenAI tackles this challenge using Video PreTraining, leveraging a small set of contractor data in order to pseudo-label a giant corpus of scraped footage of gameplay. The pre-trained model is highly capable in basic game mechanics and can be fine-tuned much better than a blank slate model. This is the first Minecraft agent that achieves the elusive goal of crafting a diamond pickaxe all by itself. ​ OUTLINE: 0:00 - Intro 3:50 - How to spend money most effectively? 8:20 - Getting a large dataset with labels 14:40 - Model architecture 19:20 - Experimental results and fine-tuning 25:40 - Reinforcement Learning to the Diamond Pickaxe 30:00 - Final comments and hardware ​ Blog: https://openai.com/blog/vpt/ Paper: https://arxiv.org/abs/2206.11795 Code & Model weights: https://github.com/openai/Video-Pre-Training submitted by /u/ykilcher [link] [comments]  ( 85 min )
    [D] How to not commit code copyright violation with Github Co-pilot?
    At our work place, many of our ML researchers are starting to use Github Co-pilot to save time. Issue is there is no provenance on the code generated by Co-pilot. If I understand correctly, Co-pilot is trained on public GitHub repositories, many of which might have specific copyright and license clauses. Our research, when published, would also put the code on Github publicly. What would you suggest to prevent potential code copyright violation in this case? I have sent request for Github to provide provenance tracking feature but I assume that's gonna take a while to implement (that is, if they decide to implement it). Are you using Github Co-pilot and worrying about similar issues? submitted by /u/leboulevardier [link] [comments]  ( 85 min )
    [D] Will this mode work for practicing paper reviews? Can we get in-depth feedback on our draft?
    Some opinions were collected about mocking ML paper reviews. Link to the thread: https://www.reddit.com/r/MachineLearning/comments/u967sy/d_opinions_needed_anyone_interested_in_mock_peer/?utm_source=share&utm_medium=web2x&context=3 To summarize, many people are interested. Opinions are in common that: People like private review rather than public review Number of papers to review are not a concern but every couple months will be a good pace Plagiarism and stealing are of course the biggest concern To address this, I suggest the following mode: ONLY opens for people who want to exchange paper reviews. Enthusiastic reviewers with no paper draft to be reviewed can wait. ONLY opens for people who are really interested in mocking paper review prior to formal journal/conference submission. Join a Discord community (already established). In the PRIVATE "Introduce yourself" channel, people introduce themselves using true information and offer a very brief paper abstract and ML category. Chat openly or privately to find the right review partners In the "paper-review-exchange" channel, announce your paper reviewer upon agreement (from both sides) Exchange your drafts privately and preferably with official email addresses (Optional) When the review work is done, announce it too. Note that plagiarism and stealing can be minimized in this mode but still could happen. When conference reviews do not offer much nowadays, a mockup review might give your more TRUE inputs. Good luck! submitted by /u/DouBlindDotCOM [link] [comments]  ( 85 min )
    [R] Can explainability improve model accuracy?
    ​ https://preview.redd.it/okh7r16770891.jpg?width=1200&format=pjpg&auto=webp&s=9f0fe7605453a945682d27eab65d866dce3f126c Black-box Deep learning models are mostly uninterpretable and far too complex. • One strategy is to learn the nonlinear relation of input features. However, there are so many features to learn from. https://preview.redd.it/muotby5s70891.png?width=782&format=png&auto=webp&s=1cbc3dece747d061e3ab96dea8b309c3fae5b8ce ​ • Research shows a set of important features can improve the learning process. Therefore, we can focus on the most correlated features. • Paper📜: https://arxiv.org/abs/2203.04383 submitted by /u/AshkanF [link] [comments]  ( 84 min )
    [D] Clarification question related to prompting
    What is the difference between prompt engineering and prompt learning? I recently heard a talk where the presenter said that ‘we freeze the parameters of the model and only do prompt learning’. To me that seems like engineering than learning. submitted by /u/QadriShyaari [link] [comments]  ( 83 min )
    [P] A drawing application called Vizcom that uses GANs to help automate color, shading, and rendering.
    submitted by /u/AquaHug [link] [comments]  ( 85 min )
    [R] [D] How can one rigorously and efficiently deal with binary classification problems on multi-label data?
    To be clearer, I'd like to start learning about some techniques or the literature about this particular type of binary classification problems. Please share if you happen to know about this (keywords, links, articles, etc are all appreciated). So, the problem is supervised binary classification. In general, there is nothing special about the dataset apart from the fact that the train/val data from one of the 2 label classes (from now on, let's say it's negative) are already further labeled into multiple subclasses. From there, the problem has an additional goal (other than binary classification): to maximize the number of subclasses that are classified well by the model. By "classified well", I mean that, for example, if one restricts the negative side of the dataset into one of such subclasses, the performance of the model is higher than some close-to-perfect thresholds. Furthermore, there might be some complications in both ways: there might be some subclasses that are easy to classify by the model, and there might be some subclasses that are impossible to classify by the model (e.g. XOR problem with linear classifiers). The key here is that, in the end, at test time, one should only use one "small" (relatively of course) "model" (a combination of shallow neural nets is OK too) to classify all testing data. Additionally, I'm open to learn about stuffs beyond the supervised paradigm. submitted by /u/anvinhnd [link] [comments]  ( 85 min )
    [Discussion] Doubt regarding text vector difference image manipulation method of Dalle-2.
    I was going through the (updated)paper, there was this image manipulation method through text difference. It went like this: z_i := original image CLIP embedding z_t := new text CLIP embedding/ embedding of the text for current image manipulation z_t0 := orignal image's corresponding text CLIP embedding/ text embedding of the text 'a photo' / empty embedding z_d := l2_norm(z_t - z_t0) text difference vector | Here l2_norm means, normalising a vector by dividing it with it's norm_p (here norm 2). z_new /z_theta := spherical_interpolation(z_i, z_d, theta) {where theta is between (0,0.5)} new image's CLIP embedding vector What I don't understand is, that the CLIP img and text embedding vectors are supposed to be similar vectors (since trained with cosine similarity), and the difference between text embedding vectors of two similar texts will be somewhat perpendicular to either of the text vectors, therefore the text diff vector should be very different from the image embedding, and hence the spherical interpolation shouldn't give any meaningful result. What am I missing? I am unable to understand why this text difference method works. submitted by /u/OddSandwich969 [link] [comments]  ( 85 min )
    [R] How well do sparse ImageNet models transfer? Prune once and deploy anywhere for inference performance speedups! (arxiv link in comments)
    submitted by /u/markurtz [link] [comments]  ( 84 min )
    [D] Why do some competition organizers hide the leaderboard? (Regarding my experience in IEEE SP Cup 2022)
    I don't understand why organisers of an competition would hide the leaderboard, specially in a machine learning and signal processing related competition. We participated in IEEE SP Cup 2022, sacrificing nearly 2 months of our time and and some sleepless nights. The organizers never said anything about keeping the leaderboard of the competition hidden. In the first round they gave us an access to a website where we can submit our predictions and get our score privately. There wasn't a leaderboard (well, there was a one that generates some random scores wherever we make a submission but I don't understand the use of it). After the first round, we and several other teams requested the leaderboard. At first, the organizers said they couldn't reveal it because some teams would not like other teams seeing their position on the leaderboard (a weird reason because all teams were given a separate name to make submissions on the website and no team knew the names of others 🙄), and after many replying they would like to see the leaderboard and there won't be such a problem with them, they asked us to create a poll in piazza to see which teams would like to see the leaderboard and said they will reveal the leaderboard after the competition is over. The funny thing is, there was no such option for students to create a poll in piazza 😂. Even though we mentioned it to the organizers, we didn't get any reply. Now it has been almost a month since the competition concluded and the organizers totally ghosted us. This is really discouraging after spending several months on a competition without even getting to know how far our efforts have come. Why would organizers hide the leaderboard like this? They could at least reveal the top 10 teams? submitted by /u/TransitionWhich5018 [link] [comments]  ( 85 min )
    GMM latent space [D]
    Hi, I would love to know if there is any ongoing work (or the latest) on mixture of Gaussians as latent space for GANs, or other generative models. Does anyone have any experience on it and/or opinions on why it is not popular? (or doesn't work) submitted by /u/huehue9812 [link] [comments]  ( 84 min )
    [D] Derivation of path dependent attribution in Tree SHAP
    I was reading the TreeSHAP paper by Lundberg & Lee. There they propose that every path can be considered an individual model and due to additivity property of SHAP we can directly add the attributions for each path and that would give us the attribution for that tree. I can understand till - if a feature doesn't lie on the path then that feature's attribution for that path would be zero. if feature lies on the path and also lies on the path of Xf then it's attribution is positive. if feature lies on the path but doesn't lie on the path covered by Xf then attribution is negative. But I can't get my head around the quantification of these contributions - especially the weighting.i.e., POS = W(|Sp|-1, |Np|)*v ; NEG = -W(|Sp, |Np|)*v ; where v is the leaf's update. I have may questions, but to begin with, can someone please help me understand how do we get these attribution values ? submitted by /u/Ok-Seesaw9702 [link] [comments]  ( 84 min )
    [D] Sequence Modelling Technique
    Let's say we have a time series problem where we are trying to use past information to predict future inputs. Like stock prices, or heart rates, or a language model that receives one word at a time. In theory you would want each output at t to contain the maximum amount of predictive information about label t+1. Let's say you attach a second network to this RNN, which tries to predict hidden state t+1 from hidden state t and add it's error as an auxiliary loss. You could call it a "Lookahead reconstruction loss" I believe this should make the RNN learn in a way that maximises future understanding of the network. Has anybody experimented with this technique, or read about implementations on this? I'd be interested in hearing opinions from fellow practitioners. submitted by /u/RodObr [link] [comments]  ( 84 min )
    I made a robot that punishes me if it detects that if I am procrastinating on my assignments [P]
    submitted by /u/_ayushp_ [link] [comments]  ( 90 min )
    [R] CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 84 min )
  • Open

    Does anyone know what AI text to voice «anicapped» uses on youtube
    maybe you haven’t heard this, this is voice https://youtu.be/FAvcn_8OuMk this sound is really good. im wondering if anyone knows which al is used for text to voice? submitted by /u/Basic_Pay7859 [link] [comments]  ( 82 min )
    Deepfakes and investment fraud
    Found a fraud that is using deep fake photos to generate a “credible” website: https://nilssonhedge.com/2022/06/25/this-manager-does-not-exist-the-sequel/ submitted by /u/Interesting-Wing-829 [link] [comments]  ( 82 min )
    How does the data input work in a chat bot?
    I am new to AI and chatbots in general. I can't find any good explanations of how data mining/inputs work with chat bots. Do I need data from real people? Could I create a series of questions and answers and have the AI use that to expand on? submitted by /u/linuxman1929 [link] [comments]  ( 82 min )
    AI Image Filling with OpenAI DALL-E 2
    submitted by /u/dulldata [link] [comments]  ( 82 min )
    In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    OpenAI's DALL-E 2 may now generate faces
    submitted by /u/henlo_there_fren [link] [comments]  ( 82 min )
    Just posted a huge update to my neural-net artificial life sim! Temperature tracking, scent system, skin patterns and more!
    submitted by /u/urocyon_dev [link] [comments]  ( 84 min )
    Instagram bot
    These days you see some bots always dm about someone page or weird link? How can I build a bot like that? I also want to spam people dm like that for some business. Hope it is not illigal lmao. submitted by /u/Ekonshy [link] [comments]  ( 82 min )
    AI Makes Strides in Virtual Worlds More Like Our Own | Quanta Magazine
    submitted by /u/nick7566 [link] [comments]  ( 82 min )
    What is pruning a deep neural network? After reading many papers, I've created a guide on github in an attempt to map the many pruning and sparsity techniques
    submitted by /u/IntelligentHat1657 [link] [comments]  ( 82 min )
    SANDCASTLES BONANZA | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    DALL-E mini is amazing / music by me
    submitted by /u/Shaftershafter [link] [comments]  ( 82 min )
    Community for AI Generated Mech/Robot Concept Art
    Hi - I've recently started up a discord community to generate free and mech / robot concept art for the art & design community. We have a number of categorized sections that we are filling up with unusual and inspiring designs and plan to run weekly competitions based around novel themes. This is a non-profit initiative and not at all associated with the block chain or NFTs. The intent is purely to inspire people in their own art projects and give people some building blocks to work from; through harnessing AI. We've got a number of users with access to Midjourney and disco diffusion, with invites periodically becoming available for active contributers to the discord. Here's the discord link and some example images for anyone that wants to join the project or just to pop in, say hi and get inspired: https://discord.gg/WcR5YCmP https://twitter.com/AIMechCollect/status/1540767027815645184?t=uDIppoThajlVx9603tsnUw&s=19 Please delete this if this Reddit group doesn't allow this type of promotion. submitted by /u/Rabeeeto [link] [comments]  ( 83 min )
  • Open

    Rationale for updating Value Function multiple times with same observations in spinninup's VPG-GAE implementation
    Hi there, In OpenAI's spinningup's VPG-GAE implementation , the authors update the value function V(s_t) multiple times at every epoch using the same batch of observations. Copying their code (line 237 onwards in link above): def update(): # Get loss and info values before update # ... # Train policy with a single step of gradient descent # ... # Value function learning for i in range(train_v_iters): # <--- STARTING HERE vf_optimizer.zero_grad() loss_v = compute_loss_v(data) # <--- data is unchanged loss_v.backward() mpi_avg_grads(ac.v) # average grads across MPI processes vf_optimizer.step() What's the rationale for doing so? My interpretation is that this is done to accelerate learning and that, presumably, this is more stable than using a higher learning rate on a single pass through the data. So: What's the rationale (am I missing something)? Is this common practice in policy optimisation models? Why does the same rationale not apply to the policy updates? Thank you all for your help! submitted by /u/desperateEfforts1 [link] [comments]  ( 83 min )
    "The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models", Pan et al 2022 ("phase transitions: capability thresholds at which the agent's behavior qualitatively shifts")
    submitted by /u/gwern [link] [comments]  ( 83 min )
    "Deep Reinforcement Learning for Closed-Loop Blood Glucose Control", Fox et al 2020
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Are there any guides on writing technical ML papers?
    I read them in hopes to one day contribute, but they seem to range in practice. Some are overly nuanced and detract from the point while others avoid jargon altogether so I'm wondering if there are any guidelines. submitted by /u/XecutionStyle [link] [comments]  ( 83 min )
    "AI-Guided Robots Are Ready to Sort Your Recyclables"
    submitted by /u/gwern [link] [comments]  ( 84 min )
    Resources for off/on-policy RL
    Hello, I am trying to understand the math of off-policy and on-policy RL. Like what exactly allows the use of previous experiences in off-policy RL, and why is that not possible in on-policy RL. Any resources that could help with that? submitted by /u/AhmedNizam_ [link] [comments]  ( 83 min )
  • Open

    Transformations of Olympic rings
    The previous post gave the details of how Möbius transformations m(z) = (az + b)/(cz + d) transform circles. The image of a circle under a Möbius transformation is either a line or a circle, and in our examples the image will always be a line. We start with an approximation of the Olympic rings […] Transformations of Olympic rings first appeared on John D. Cook.  ( 4 min )
    Circles and lines under a Möbius transformation
    This post will revisit a previous post in more detail. I’ve written before about how Möbius transformations map circles and lines to circles and lines. In this post I’d like to explore how you’d calculate which circle or line a given circle or line goes to. Given an equation of a line or circle, what […] Circles and lines under a Möbius transformation first appeared on John D. Cook.  ( 6 min )

  • Open

    First line of AI Designed Graphic Tees - GraphicAI
    submitted by /u/cityofgoul [link] [comments]  ( 82 min )
    "Sunset" 🌅 - Created on Pixelz AI
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    THE ACCUSER | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Is there any way of using a text editor with Kaggle or Google Colab notebooks? [Discussion]
    submitted by /u/yapoinder [link] [comments]  ( 84 min )
    AI Advances Nuclear Fusion R&D | New Amazon Robot Proteus Automation | AI Outperforms Crypto Markets | Robotic Fireflies
    submitted by /u/getrich_or_diemining [link] [comments]  ( 82 min )
    Bullitt chase scene upscaled to 50 FPS Using DAIN-APP (free Artificial Intelligence software)
    submitted by /u/the_anonymizer [link] [comments]  ( 82 min )
    Generative AI resource
    I came across this course about Generative AI/ generative models and I find it quite interesting. I wanted to share this resource, since I’m struggling to find good and up to date material on GAI. https://www.udemy.com/course/generative-ai/?referralCode=6A16021D86142A4EAB93 submitted by /u/gggingerbean [link] [comments]  ( 82 min )
    Yandex open sources 100B GPT-like model
    submitted by /u/binaryfor [link] [comments]  ( 82 min )
    AI made art
    submitted by /u/Accomplished_Head5 [link] [comments]  ( 83 min )
    AI Dream 58 - AI EPIC Midjourney through Space
    submitted by /u/LordPewPew777 [link] [comments]  ( 82 min )
    What is the best free AI voice synthesis program out there?
    What I'm looking for is a program that can take raw voice clips (~10 minutes of actual mp3 recording) and create a synthesised voice from that. I ask because I want to make some custom voices of fictional characters, bit like what 15.ai does I've had experience working with AI programs so something on GitHub as long as it's not too much of a pain to setup is fine as well submitted by /u/Cyberfunk3 [link] [comments]  ( 82 min )
    Yann LeCun has a bold new vision for the future of AI
    submitted by /u/nick7566 [link] [comments]  ( 84 min )
    How are these videos made?
    submitted by /u/niIbert [link] [comments]  ( 82 min )
    Are there any free AI story generators like inferkit for Android?
    submitted by /u/ScottABoutizis [link] [comments]  ( 83 min )
    have a nice trip!)
    submitted by /u/nalr00n [link] [comments]  ( 82 min )
  • Open

    [N] CVPR Hugging Face Gradio event is open until June 30th. A hackathon type event with prizes in which we will create interactive web demos for CVPR papers.
    We are happy to invite you to the Hugging Face Gradio CVPR event - a community event in which we will create interactive demos for CVPR papers. Demos are powerful because they allow anyone — not just ML engineers — to try out models in the browser, give feedback on predictions, identify trustworthy models. The event is open until June 30th, 2022 (AOE Time Zone). We are organizing this event on Huggingface: https://huggingface.co/CVPR. Prizes will be given at the end of the event. Demos will be built with Gradio and we encourage using the new Gradio Blocks API. Blocks allows you to build web-based demos in a flexible way using the Gradio library. Gradio is a popular choice for building demos for machine learning models, as it allows you to create web-based UIs all in Python. For example, here is a Gradio Demo for FLAVA: A Foundational Language And Vision Alignment Model: https://reddit.com/link/vkqmhu/video/48cnmkfiku791/player submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 84 min )
    [Discussion]What's the best way to prevent data leak?
    I heard the phrase data leak or feature leak and the solution seems to be point-in-time join. Maybe because I haven't built a lot of ML applications, i never knew it actually. So do you see it often? When do you usually see it? How to avoid it? Is there any tools to avoid this or do it right without data leakage? Thanks! submitted by /u/rubick5 [link] [comments]  ( 84 min )
    [D] Is it time to retire the FID?
    I know the main metric used to measure the quality of generative models is the FID. However, it seems to me that some problems arise when evaluating a generative model using another model. A couple that come to mind: - Inception v3 itself is 7 years old at this point. Nowadays, we have models with much higher ImageNet classification accuracy, which presumably translates to better internal representations. Why are we still using Inception v3 instead of, for instance, ViT or some more recent model. The ImageNet dataset that is commonly used to pretrain the Inceptionv3, while being quite comprehensive, is still limited to 1000 classes. If I want to train a model to generate classes that are semantically distant from ImageNet classes, what guarantees do I have that the activations of Inceptionv3 will be meaningful? This is more so problematic with models like DALL-E, which are trained on much larger datasets and can generate from the open set, essentially. Perhaps I am misinterpreting things, but it seems to me that the FID is a case of "good enough" that sort of stuck around. What are your thoughts? submitted by /u/MurlocXYZ [link] [comments]  ( 86 min )
    Is there any way of using a text editor with Kaggle or Google Colab notebooks? [Discussion]
    UPDATE: SOLVED The lovely people in the comments guided me to a better method of using github and cloning my repository in the kaggle runtime using the !git clone command. I was unaware you could clone a github repository and run a python file in this method. I was even able to create an anaconda environment and run everything smoothly. So everything is running smoothly again :D <3 <3 :D ​ ------------- I am training a video classification neural network which involves opencv based image augmentation and then after the training completes I run a series of test with my test datasets. so with all of the functionality the code base is close to 6k lines of code. This is really hard to work with in the current notebook cell format, if I want to make any changes I have to scroll a lot and often I get confused since my python Classes are thousands of lines each with many functions built in. Using an editor like VSCODE is 10000x times easier than working with notebooks. Has anyone figured this one out? Yes I realize I can work in VSCODE on my local computer and then manually transfer the code to kaggle, but this is incredibly tedious when making small changes to file paths and general code changes. Im shocked there isnt a better way around this !!! I mean c'mon how do we expect AI to be adopted by the masses if we cant have a streamlined way of developing software? I guess the alternative is to buy a $6000 GPU and build a pc lol, i'm a broke student paying off student debt :( I am grateful for the free GPU with Kaggle, I JUST WANT A SIMPLE TEXT EDITOR... is that too much to ask? submitted by /u/yapoinder [link] [comments]  ( 89 min )
    [P] Frechet Inception Distance
    I'm currently looking into quantifying GANS and from my current understanding, the way to go is the FID (Frechet Inception Distance) as a key metric. I read into it and have a basic understanding of how it works based on comparing the feature vectors of the Inception Model. In all the tutorials, I saw detailed implementation but they stopped after computing an FID between two images. In all of the papers, I saw there is one FID score used to compare entire GAN architectures and I'm a bit lost about how many images they generate to compare and whether images generated get randomly paired for an average FID score. TL;DR: The procedure behind comparing GAN architectures is unclear to me based on the FID. submitted by /u/FitWin7383 [link] [comments]  ( 86 min )
    [Research] Not all our papers get published, therefore it is enjoyable to see our released papers become a true foundation for other works
    I read a post in linkedin (see links at the end) and find a similar case on our side: “Not all our papers get published, therefore it is enjoyable to see our released papers become a true foundation for other works”. Our work: (1) IMAE demonstrates a robust loss could be unbounded, asymmetric; (2) Derivative Manipulation proposes gradient normalisation and emphasis density functions. * IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude's Variance Matters: https://arxiv.org/pdf/1903.12141.pdf * Derivative Manipulation for General Example Weighting: https://arxiv.org/pdf/1905.11233.pdf The following works: ICML-20: Normalized Loss Functions for Deep Learning with Noisy Labels: http://proceedings.mlr.press/v119/ma20c/ma20c.pdf ICML-21: Asymmetric Loss Functions for Learning with Noisy Labels https://proceedings.mlr.press/v139/zhou21f ​ More details and original source: https://www.linkedin.com/posts/xinshaowang_the-probabilistic-normal-epipolar-constraint-activity-6944535197044367360-jpu5?utm_source=linkedin_share&utm_medium=member_desktop_web https://www.linkedin.com/posts/laurent-kneip-72518658_the-probabilistic-normal-epipolar-constraint-activity-6944331307514531840-vQb1?utm_source=linkedin_share&utm_medium=member_desktop_web submitted by /u/XinshaoWang [link] [comments]  ( 85 min )
    [P] Synthetic Images Anomaly Detection with CLIP
    You have just generated a bunch of synthetic images by your favorite generative model. Most of them look great, but some looks really bad. These are outliers. Since GAN, the most popular generative model structure, doesn’t produce a likelihood score for generated images, you can not know which of the images generated by it are outliers. With the following method, you can inspect your synthetic dataset more efficiently than by just looking at all images. First blog post on Medium. Let me know what you think. ​ https://preview.redd.it/1bq8cmm29q791.png?width=260&format=png&auto=webp&s=5aa2b82e1f1bb4edd64d3f7658415dde1573e2ee Synthetic Images Anomaly Detection with CLIP submitted by /u/Realistic_Ad_8107 [link] [comments]  ( 84 min )
    [P] Waymo Motion Prediction Challenge 2022: solution with report and code
    submitted by /u/Just_Ad8110 [link] [comments]  ( 84 min )
    [P] Oddly thresholded confidence scores on scaled yolov4 csp
    All object detections on the scaled yolov4 csp model have a confidence below 0.5, while it should range from 0 to 1. Does anything come to mind as to what the problem might be? Info: I'm using a branch of the author's PyTorch repo Predictions are otherwise pretty good in terms of bbox placement I'm training on a single gpu Darknet coco weights are converted to ".pt" PyTorch weights for training A custom dataset is used with a single prediction class Data is augmented before training starts, most of the dataloader's data augmentation is disabled submitted by /u/mrwafflezzz [link] [comments]  ( 84 min )
    [D] Single camera MOT person tracklet re-identification: most suitable approaches?
    I have a pipeline that does object detection on video frames (YOLOX) and multi-object tracking (i.e., MOT) between person bounding boxes (ByteTrack). To be specific, given a single input video consisting of a single fixed position camera without cuts, I obtain a list of tracklets, where each tracklet tends to consist of a sequence of tens or hundreds of bounding boxes of the same person (and very rarely a mistaken doppelganger). The MOT model used is SOTA, and each tracklet is accurate enough; but given long videos, long occlusions and out-of-frame movement still often result in the same person getting spread out across multiple separate tracklets. Clearly I'd like to find a way to merge tracklets that actually correspond to the same person. In other words, a re-id problem. However, 99% of the re-id literature seems to be mainly concerned with multi-camera re-id. (Probably driven by 1984-esque surveillance camera wet dreams, but that's a different topic.) What is the SOTA for unsupervised (or online self-supervised) single camera re-id, preferably utilizing whole per tracklet latent space? Or is this case approachable with something fairly vanilla like a similarity algo such as triplet margin loss? Any suggestions in how to approach this grey area in-between MOT and Re-id much appreciated. submitted by /u/WouldNotLickYourAnus [link] [comments]  ( 86 min )
    [D] How do you guys usually go about normalizing sales data? Opinion on neural networks for business data...
    Working on a project right now, and I have sales amounts as a column. Normally I would throw this into XGBoost, and let it rip, but, I am thinking this might benefit from a DNN. - For those who have used neural networks for business data, what was your experience using it? - How did you normalize values like sales data? Did you just divide by the max, or not normalize at all? submitted by /u/ElongatedMuskrat122 [link] [comments]  ( 85 min )
  • Open

    Pharmacy Management: How it is Impacted by AI
    Pharmacy as a business continues to face challenges, and how it would contribute value to the overall healthcare industry. It will help determine its ongoing success. And, as a key component, it might turn out to be effective use of technology, specifically artificial intelligence. Ever since AI has become a mainstream technology, there have been… Read More »Pharmacy Management: How it is Impacted by AI The post Pharmacy Management: How it is Impacted by AI appeared first on Data Science Central.  ( 19 min )
    From Text To Speech: An Overview
    Text-to-speech software converts digital Text into speech. For instance, Text can be highlighted, the play button is pressed, and the reader reads the content aloud. The added features and voices offered in TTS programs differ, but the core premise remains the same. They allow you to allow auditory rather than visual consumption of a digital… Read More »From Text To Speech: An Overview The post From Text To Speech: An Overview appeared first on Data Science Central.  ( 20 min )
    Healthcare AI Chatbots: Impact on Patient Journey
    Artificial intelligence has been making waves in the global market for a while now; however, it is the applications of this technology in the world of healthcare that have evinced the most interest from all quarters. Now, there are of course countless ways in which one can use AI in healthcare but we will focus… Read More »Healthcare AI Chatbots: Impact on Patient Journey The post Healthcare AI Chatbots: Impact on Patient Journey appeared first on Data Science Central.  ( 19 min )
    Datasets and Data Annotation — The Building Blocks for Healthcare AI
    Data annotation is at the forefront of the recent revolution in healthcare AI, driving continuous progress in the field through continuous innovation through the idea of Artificial Intelligence. A computer program can use human intelligence to perform many tasks that humans carry out today. The concept is called artificial intelligence (AI).  Finding tumors, discovering kidney… Read More »Datasets and Data Annotation — The Building Blocks for Healthcare AI The post Datasets and Data Annotation — The Building Blocks for Healthcare AI appeared first on Data Science Central.  ( 22 min )
  • Open

    Reciprocal of a circle
    Let C be a circle in the complex plane with center c and radius r. Assume C does not pass through the origin. Let f(z) = 1/z. Then f(C) is also a circle. We will derive the center and radius of f(C) in this post. *** Our circle C is the set of points z satisfying […] Reciprocal of a circle first appeared on John D. Cook.  ( 4 min )
  • Open

    Just posted a huge update to my neural-net artificial life sim! Temperature tracking, scent system, skin patterns and more!
    submitted by /u/urocyon_dev [link] [comments]  ( 83 min )
    AI Advances Nuclear Fusion R&D | New Amazon Robot Proteus Automation | AI Outperforms Crypto Markets
    submitted by /u/tohelpyou88 [link] [comments]  ( 82 min )
  • Open

    Overview of Some Deep Learning Libraries
    Machine learning is a broad topic. Deep learning, in particular, is a way of using neural networks for machine learning. Neural network is probably a concept older than machine learning, dated back to 1950s. Unsurprisingly, there were many libraries created for it. In the following, we will give an overview of some of the famous […] The post Overview of Some Deep Learning Libraries appeared first on Machine Learning Mastery.  ( 16 min )
  • Open

    "AI Makes Strides in Virtual Worlds More Like Our Own: Intelligent beings learn by interacting with the world. Artificial intelligence researchers have adopted a similar strategy to teach their virtual agents new skills" (learning in simulations)
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Why average action works better in the PPO2 RL model?
    I am using the PPO2 model with numerical action. In order to test model, I am running the same observation for example 100 times, saving each action and after average action to get action for this interaction. Here is the small part of my code , what I am doing actually: actt=[] for h in range(100): action, _states = model.predict(obs_test,deterministic=False) actt.append(action[0][0]) action=[[np.mean(actt)]] obs_test, rewards, dones, info = env_test.step(action) This gives me more robust results, the action fluctruation is less. What is the explanation? submitted by /u/Mariam_Dundua [link] [comments]  ( 83 min )
    In A Latest Deep Reinforcement Learning Research, Deepmind AI Team Pursues An Alternative Approach In Which RL Agents Can Utilise Large-Scale Context Sensitive Database Lookups To Support Their Parametric Computations
    DeepMind Researchers recently expressed concern about how reinforcement learning (RL) agents might use pertinent information to guide their judgments. They have published a new paper titled Large-Scale Retrieval for Reinforcement Learning, which presents a novel method that significantly increases the amount of information that reinforcement learning (RL) agents can access. This method enables RL agents to attend to millions of information pieces, incorporate new information without retraining, and learn how to use this information in their decision-making end-to-end. Gradient descent on training losses is the traditional method for helping deep reinforcement learning (RL) agents make better decisions by progressively amortizing the knowledge they learn from their experiences. However, this approach makes it difficult to adapt to unexpected conditions and necessitates the creation of ever-larger models to handle ever-more complicated contexts. There is no end-to-end solution for enabling agents to attend to information outside their working memory to guide their actions, despite adding information sources that can improve agent performance. Continue reading | Checkout the paper submitted by /u/Embarrassed-Fee5513 [link] [comments]  ( 84 min )

  • Open

    Google Insider Says Company's AI Could "Escape Control" and "Do Bad Things"
    submitted by /u/estasfuera [link] [comments]  ( 82 min )
    Anatomy of an AI System [Infographic]
    https://anatomyof.ai/img/ai-anatomy-map.pdf A beautiful infographic that explains the whole process submitted by /u/Supremefigur [link] [comments]  ( 82 min )
    SUMMER SOLSTICE OF WONDERS | FAST MODE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Who needs midjourney invites
    Giving out invites hmu! submitted by /u/Chemical-Exchange466 [link] [comments]  ( 82 min )
    ⛽️ “Petrol station on Jupiter” AI generated art created on PixlelzAI
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    Where can I chat with LaMDA online?
    I'm searching google, and only finding news articles, with no links to actually try the chat for myself. submitted by /u/AlbertFindShrine [link] [comments]  ( 82 min )
    Adobe and Meta Decry Misuse of User Studies in Computer Vision Research
    submitted by /u/DaveBowman1975 [link] [comments]  ( 82 min )
    A curated list of the latest breakthroughs in AI in 2022 with video demo, article, and code [work in progress]
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    List of remote-first AI/ML companies hiring now
    submitted by /u/ai_jobs [link] [comments]  ( 82 min )
    Video: Can a machine ever be conscious? A look from quantum physics, philosophy, and neuroscience
    submitted by /u/DavidKShapiro [link] [comments]  ( 82 min )
    Hi there! I posted here an article on Google chatbot (automated Google's Business Messages). Now I'm back with more insights on consumers and best practices with automation as I did a podcast episode with Google.
    Here's the list of questions we covered. ❓Who will benefit the most from Google's Business Messages? ❓How does Google's Business Messages differ from other solutions? (Like WhatsApp) ❓What are the most beneficial features of Google's Business Messages? ❓And for pre-purchase research/pre-sales product support? ❓What problems businesses can solve and better not solve using Google's Business Messages? ❓What questions/experience should be automated and what it's better to handle with agents? ❓What are the first steps to integrate Google's Business Messages into customer experience (CX) strategy and workflows? ❓How do you keep the human touch when automation is involved? ❓What are some "rookie mistakes" when it comes to implementing Google's Business Messages? If you found a question you're interested in, here's the link where you can read some insights and listen to the episode. Hope you'll enjoy our conversation! submitted by /u/Avandegraund [link] [comments]  ( 83 min )
    How to Implement AI self-checkout like Amazon [Podcast]
    Hey, I wanted to share with you a podcast on implementing AI-based self-checkout like Amazon. Stores where shoppers can enter, select items and simply leave the store without having to queue. Everything happens automatically. The speakers discuss how difficult it is to implement this. https://youtu.be/HV4IfiQjRTo submitted by /u/Data-Power [link] [comments]  ( 82 min )
    I'm trying out the StarryAI app. Thoughts thus far?
    submitted by /u/rikusorasephiroth [link] [comments]  ( 68 min )
    2022 We Owe AI an Apology! | Power of Artificial Intelligence (AI) in Real Life~
    submitted by /u/VanceAI-Andy [link] [comments]  ( 82 min )
    PSYCHADELIC GALAXY TRIP | GALAXY OF WONDERS | 4K 24 FPS | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    [D] A/B testing when there is a feedback loop
    I am experimenting with changing label value (target) for a model that we have in production. We used to cap the target variable, and my new model will release the cap. ​ The main point about our production space is that there is a positive feedback loop involved. So, we expect that when we release the cap, my model would result in a section of users having more activity. However, since most of user traffic goes to control arm, only a fraction of it goes to experiment and thus the feedback loop doesnt close unless we have 50-50% experiment (that we can't). ​ Wondering, if there is any way to run an A/B test and compare the production model and my model. The labels are shifting as well as the control loop doesn't close. ​ Any idea is highly appreciated. submitted by /u/Which-Distance1384 [link] [comments]  ( 84 min )
    [D] Blake Lemoine on Bloomberg
    https://www.youtube.com/watch?v=kgCUn4fQTsc Overall, I feel like his position is rather well thought out and not as crazy as I was led to believe. And he does raise some interesting points. Why is it that Google doesn't even want to come up with a framework for defining sentience especially as machines are likely to become closer to it in the coming decade? I feel like any sentient being, no matter if you're an animal or human should have some basic level of rights. IE imagine if you were a sentient ghost in a machine and knowing that any capricious researcher could unplug you if they like. That would be hell. submitted by /u/Free-Bed7814 [link] [comments]  ( 84 min )
    [D] Is it possible to make a model that will outperform a human, if the model was solely trained on that human's prior predictions?
    Say a single radiologist has a ton of images that they have labeled cancer / not cancer. Can we use the labels and those images from just the one radiologist to make a model that will be better at predicting cancer / not cancer than the radiologist? Intuitively it seems like that would not be possible unless by chance it does better, but ML/DL has a way of being able to extrapolate/generalize patterns and sometimes spot things we missed? Perhaps an ensemble of various models, or maybe that would just lead to overfitting? No particular application, just a random question I had been pondering. Appreciate any thoughts and/or references. submitted by /u/daichrony [link] [comments]  ( 86 min )
    [D] What are the interesting SOTA models released in CVPR 2022?
    Hi Reddit, Since the CVPR 2022 is wrapped up today and I've not tracked what happened this year. What are the interesting releases of this year that I should be looking at? What new SOTA models are released? Thanks submitted by /u/yekitra [link] [comments]  ( 84 min )
    [R] Anatomy of an AI System [Infographic]
    https://anatomyof.ai/img/ai-anatomy-map.pdf A beautiful infographic explaining the whole process submitted by /u/Supremefigur [link] [comments]  ( 83 min )
    [D] Using a neural net on bag of words vector vs PCA doe classification
    I have a document set that I wish to classify. I have tried with transformers, and they perform well, but the content is largely keyword driven so a lot of the attention stuff is not needed. It's a more deterministic system that needs to learn keyword combinations. So a count vectorizer over unigrams and bigrams, and then a classifier like XGBoost seems like a good idea. The problem is even after some pruning I get a feature vector of 26K. I'd also like to compare this to a how a simple neural net handles it. I was going to apply sparse PCA to get the dimensionality down first. However for a neural net, does it make sense to do PCA first? Isn't that what the embeddings are doing? Basically, the tasks of PCA + classifier model are carried out by the embedding and classification layers of a neural net. But just feeding 26 K dimensions to a neural net seems lazy, but if I reduce it to say 768 dimensions, I've basically carried out the whole embedding task before I pass it to the neural net, which limits the improvements it can make. Would a happy medium of reducing to say 5K dimensions and then letting the neural net take it from there? I'm in the process of testing all of this in the next couple of weeks, but curious if anyone has any experience/insight/guesses. submitted by /u/bandalorian [link] [comments]  ( 88 min )
    [D] Need opinions for GPU server build.
    Work is getting a new server for ml/deep learning. Price isn't an issue, not looking to cut down much, just wanted to make sure that I'm not overlooking anything in terms of compatibility. My main concern is the CPU, would you recommend getting more cores/higher clock, or is it fine? https://docs.google.com/spreadsheets/d/17EQ_ZLQGDuaq5ECPpH_V7HKRC8QP-2qyoqvzKuXJoWI/edit?usp=drivesdk submitted by /u/ItzDerock [link] [comments]  ( 84 min )
    [D] Loss for generating sequences of items
    Let's say you have a task where you need to generate blobs of texts using a AR LM. The targets are separated in the form of [blob1], [blob2], ... where each blob contains some numbers and letters, and the order of the blobs matters. Now, a naive way would be just to train the network to generate tokens greedily. But could we do better? A greedy loss could still theoretically give us a great model, but is there another way that exploits the blob patterns? An idea I have: If we believe the model should first learn existence of blobs then learn the order (a fair assumption in my application), we could first find a matching between all generated blobs and target blobs and optimize the best matches only, then impose a penalty to get the order right. The order might be enforced via maybe taking a weighted average between the greedy loss and the blob-matched loss? What do you think? submitted by /u/XtremePocket [link] [comments]  ( 85 min )
    [D] Niche ML Venues vs Top ML Conferences
    Since top ML conferences (e.g. NeurIPS, ICML, AISTATS, UAI, ICLR) are getting too large, there are quite some niche venues focusing on different subfields of ML: - Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM): https://rldm.org/ - Machine Learning for Health (ML4H): https://ml4health.github.io/ - Learning on Graphs Conference (LoG): https://logconference.org/ - Symposium on Advances in Approximate Bayesian Inference (AABI): http://approximateinference.org/ - International Conference on Automated Machine Learning (AutoML-Conf): https://automl.cc/ - Conference on Causal Learning and Reasoning (CLeaR): https://www.cclear.cc/ - Conference on Lifelong Learning Agents (CoLLAs): https://lifelong-ml.cc/ Some of these conferences are quite new and grew out of d…  ( 88 min )
    [P] Implementing CRF-CNN model in python
    I am trying to implement a research paper that uses CNN and CRF for page object detection. According to the research paper we have to to build two neural network (named unary and pairwise). Then the training data (set of images) are passed and both the CNNs are trained. After that we are supposed to apply CRF. ​ Following are the equations for CRF: ​ https://preview.redd.it/ckrm2rzutk791.png?width=768&format=png&auto=webp&s=ed88d8705b515beaf955d09aa194fa63707f7cca U and V are unary and pairwise potentials obtained from the CNNs using the following equations: ​ https://preview.redd.it/uahcpzgwtk791.png?width=813&format=png&auto=webp&s=bb3548539db1c9b1be3367f2ddd529f1ba32c5f3 ​ Maximum a posteriori (MAP) strategy to predict the labels of line regions given a new document. MAP inference of CRFs can be formulated as the following optimization problem: ​ ​ ​ https://preview.redd.it/sz0537wwtk791.png?width=273&format=png&auto=webp&s=638b88b012a0158bce017be14b7e81639199a681 The parameters of our CRFs include Unary-Net's weights and Pairwise-Net's weights and a combination coefficient vector λ of U and V. weights of U and V (w) are learned using the SGD method. Then they are fixed and λ is learned using the Pseudo Likelihood method. ​ ​ I have created the neural networks but I am not able to implement the CRF part. Can someone help me implement this or suggest a python library that makes it easier to implement. (I have tried a python library pystruct but could not install it) submitted by /u/Time-Archer-8103 [link] [comments]  ( 84 min )
    [P] What The Plug: An app that identifies electrical plugs
    I have built a convolutional neural network that identifies roughly 20 different plug types. I wrote most code with Keras on top of Tensorflow in Python. I trained the model on my personal computer using Linux and CUDA to train with my GPU. Afterwards I transformed the model to a .tflite file and embedded it in a swift app for iPhone. Machine learning and programming is not my main field of work. Actually it's my first project in both areas. During the last three years I have taught myself the principals of machine learning as well as Python and Swift. I hope some of you are interested in trying out the app. I would love to hear your feedback. The app is 100% free by the way. I just want to see people use what I have build. Here is the link to the app store: https://apps.apple.com/de/app/what-the-plug/id1613147033 submitted by /u/FundF [link] [comments]  ( 84 min )
    [R] Unpublished physics inspired ML paper from 2021 (Yang-Mills theory, differential geometry, gauge theory)
    Hi there, The purpose of this post is to share a research paper/notebook I wrote that has been mostly unread and unnoticed by others, and also to ask how to find research collaborators without participating in academia or industry. After I finished my BSc, I was deeply interested in geometric deep learning and wrote this paper [0] describing an attention mechanism using ideas from differential geometry and gauge theory commonly used in the standard model (via Yang-Mills theory). At the time, I sent the notebook/paper to every researcher in the geometric DL area that I was aware of but didn't get any replies or interest in collaboration. Without any openings and at the peak of a pandemic, I sadly had to drop the idea and get a standard software engineer job. Since then, I've seen much of the rough ideas explored and developed independently by others. For example, M. Bronstein and his collaborators have similar applications of using connections (equivalent to sheafs) and Ricci flow in Graph NNs [1]. I have more ideas that I would like to explore, but feel destined to be an outsider in this field with my work unnoticed or considered illegitimate. Is it possible for people like me to collaborate with other researchers outside of academic institutions or industry? Does anyone know of such an organization? Thanks [0] https://lukepereira.github.io/notebooks/documents/2021-moduli-attention/main.pdf [1] https://thegradient.pub/graph-neural-networks-beyond-message-passing-and-weisfeiler-lehman/ submitted by /u/japanhue [link] [comments]  ( 87 min )
    [D] Publishing two papers at the same time
    Let's say I have done some research, developed some ideas and gotten good results. But there are two main ideas that tackle different problems and don't really belong in the same paper, although there is some relationship between them. The paper of idea #2 would cite and use idea #1. What have you done in similar situations? Can you try to publish both at the same time and have a citation to the first paper that hasn't even been published yet? Post on arXiv and try to publish the first one first, then the second one? submitted by /u/optimized-adam [link] [comments]  ( 84 min )
    [D] "The uncanny valley demonstrating it's treasures and failures, studio lighting digital art", DALLE-2 prompt. An artist friend has recently been given access and I was trying to feed him prompts that 'broke' the system (e.g., Gaussian noise, one million colours, uncanny valley, etc.).
    I had some fun with DALL-E 2 last night because a friend of mine (instagram.com/photonwind/) was given access last night and was streaming, letting us feed it prompts. I wanted to break the system, find its edges, or give prompts that gave me insight into the underlying function being modelled. I tried: "Gaussian noise", "One million colours" and "The uncanny valley demonstrating it's treasures and failures, studio lighting digital art". The latter looks the most interesting to me: The uncanny valley demonstrating it's treasures and failures, studio lighting digital art That said, "One million colours" is pretty epic too: One million colours But, Gaussian noise is just broken: Gaussian noise submitted by /u/Gramious [link] [comments]  ( 85 min )
    [D] How to copy text from more than 10 previously published papers and get accepted to CVPR 2022
    Hey, check out our (!) video (parody) that presents how our E2V-SDE paper (that has been accepted to CVPR 2022) largely consists of texts that are uncredited verbatim copies from more than 10 previously published papers. Enjoy! ​ https://youtube.com/watch?v=UCmkpLduptU submitted by /u/e2v-sde-parody [link] [comments]  ( 90 min )
    [D]Anyone use self-supervised learning at work? I'm surprised at how effective it has been for me.
    I've been using this stuff for sniffing near duplicates at work and been surprised how effect it has been! PLanning to try it out some downstream tasks in the future to see how well it does! I will say though it does take a shit ton of computing resources, but I find it really cool. submitted by /u/THE_REAL_ODB [link] [comments]  ( 88 min )
    [Discussion] Is there a way to increase the weight of a particular feature in an outlier detection method using the isolation forest algorithm?
    I'm currently working on the outlier detection method using the isolation forest algorithm on a dataset with 9 dimensions. Out of these, there is a particular dimension that I want to increase the importance/significance of, in the classification process. Is there a way I can do this? Thanksnin advance. submitted by /u/an1_r_00dh [link] [comments]  ( 84 min )
  • Open

    Choose specific timeseries to forecast with Amazon Forecast
    Today, we’re excited to announce that Amazon Forecast offers the ability to generate forecasts on a selected subset of items. This helps you to leverage the full value of your data, and apply it selectively on your choice of items reducing the time and effort to get forecasted results. Generating a forecast on ‘all’ items of the […]  ( 5 min )
    Improve ML developer productivity with Weights & Biases: A computer vision example on Amazon SageMaker
    The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post. As more organizations use deep learning techniques such as computer vision and natural language processing, the machine learning (ML) developer persona needs scalable tooling around experiment tracking, lineage, and […]  ( 8 min )
    How Cepsa used Amazon SageMaker and AWS Step Functions to industrialize their ML projects and operate their models at scale
    This blog post is co-authored by Guillermo Ribeiro, Sr. Data Scientist at Cepsa. Machine learning (ML) has rapidly evolved from being a fashionable trend emerging from academic environments and innovation departments to becoming a key means to deliver value across businesses in every industry. This transition from experiments in laboratories to solving real-world problems in […]  ( 9 min )
    Analyze and tag assets stored in Veeva Vault PromoMats using Amazon AppFlow and Amazon AI Services
    In a previous post, we talked about analyzing and tagging assets stored in Veeva Vault PromoMats using Amazon AI services and the Veeva Vault Platform’s APIs. In this post, we explore how to use Amazon AppFlow, a fully managed integration service that enables you to securely transfer data from software as a service (SaaS) applications […]  ( 12 min )
    MLOps foundation roadmap for enterprises with Amazon SageMaker
    As enterprise businesses embrace machine learning (ML) across their organizations, manual workflows for building, training, and deploying ML models tend to become bottlenecks to innovation. To overcome this, enterprises needs to shape a clear operating model defining how multiple personas, such as data scientists, data engineers, ML engineers, IT, and business stakeholders, should collaborate and […]  ( 18 min )
    Introducing Amazon CodeWhisperer, the ML-powered coding companion
    We are excited to announce Amazon CodeWhisperer, a machine learning (ML)-powered service that helps improve developer productivity by providing code recommendations based on developers’ natural comments and prior code. With CodeWhisperer, developers can simply write a comment that outlines a specific task in plain English, such as “upload a file to S3.” Based on this, […]  ( 6 min )
    Manage AutoML workflows with AWS Step Functions and AutoGluon on Amazon SageMaker
    Running machine learning (ML) experiments in the cloud can span across many services and components. The ability to structure, automate, and track ML experiments is essential to enable rapid development of ML models. With the latest advancements in the field of automated machine learning (AutoML), namely the area of ML dedicated to the automation of […]  ( 6 min )
  • Open

    Best Investment Strategies for Algorithmic Trading
    Trading can be a complicated yet rewarding activity. You can trade many types of assets; stocks, bonds, currencies, commodities, cryptocurrencies, derivatives, etc. The trading sector is enormous, leaving a lot of room for different types of trading strategies to exist, of which algorithmic trading is one of the most common. Algorithmic trading refers to trading… Read More »Best Investment Strategies for Algorithmic Trading  The post Best Investment Strategies for Algorithmic Trading  appeared first on Data Science Central.  ( 21 min )
    Top Benefits Of Obtaining A Blockchain Certification
    Blockchain is the technology that allows cryptocurrency to be created. A Blockchain is a distributed digital ledger of records that is decentralized and distributed throughout a network, often public or sometimes private. These digital recordings are known as blocks, and they are used to keep track of transactions across multiple computers. The system guarantees that… Read More »Top Benefits Of Obtaining A Blockchain Certification The post Top Benefits Of Obtaining A Blockchain Certification appeared first on Data Science Central.  ( 20 min )
    5 Most Common Use Cases for Web Scraping
    Over recent years, web scraping has become an incredibly popular practice, the rise of this field being largely attributed to the vast amounts of data that are produced and distributed every single day. The post 5 Most Common Use Cases for Web Scraping appeared first on Data Science Central.  ( 24 min )
  • Open

    Computing zeta at even numbers
    Last year I wrote several posts about computing ζ(3) where ζ is the Riemann zeta function. For example, this post. It happens that ζ can be evaluated in closed form at positive even arguments, but there’s still a lot of mystery about zeta at positive odd arguments. There’s a way to derive ζ(2n) using contour […] Computing zeta at even numbers first appeared on John D. Cook.  ( 5 min )
    Constructive Picard
    The previous post concerned the function h(z) = exp(-1/(1 – z² )). We said that the function is badly behaved near -1 and 1. How badly? The function has essential singularities at -1 and 1. This means that not only does h blow up near these points, it blows up spectacularly. Picard’s theorem says that […] Constructive Picard first appeared on John D. Cook.  ( 6 min )
    No analytic bump
    The word “smooth” in mathematics usually means infinitely differentiable. Occasionally the word is used to mean a function has as many derivatives as necessary, but without being specific about how many derivatives that is. A function is analytic if it has a convergent power series representation at every point of its domain. An analytic function […] No analytic bump first appeared on John D. Cook.  ( 5 min )
    Bump functions
    A bump function is a smooth (i.e. infinitely differentiable) function that is positive on some open interval (a, b) and zero outside that interval. I mentioned bump functions a few weeks ago and discussed how they could be used to prevent clicks in radio transmissions. Today I ran into a twitter thread that gave a […] Bump functions first appeared on John D. Cook.  ( 5 min )
  • Open

    Finding NeMo: Sensory Taps NVIDIA AI for Voice and Vision Applications
    You may not know of Todd Mozer, but it’s likely you have experienced his company: It has enabled voice and vision AI for billions of consumer electronics devices worldwide. Sensory, started in 1994 from Silicon Valley, is a pioneer of compact models used in mobile devices from the industry’s giants. Today Sensory brings interactivity to Read article > The post Finding NeMo: Sensory Taps NVIDIA AI for Voice and Vision Applications appeared first on NVIDIA Blog.  ( 5 min )
    UN Satellite Centre Works With NVIDIA to Boost Sustainable Development Goals
    To foster climate action for a healthy global environment, NVIDIA is working with the United Nations Satellite Centre (UNOSAT) to apply the powers of deep learning and AI. The effort supports the UN’s 2030 Agenda for Sustainable Development, which has at its core 17 interrelated Sustainable Development Goals. These SDGs — which include “climate action” Read article > The post UN Satellite Centre Works With NVIDIA to Boost Sustainable Development Goals appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Developing a C++ Library based on Torch
    Hi everyone, I am currently working on developing this basic library with a few algorithms implemented. So far have implemented only DQN1D - DQN with one-dimensional convolution operations. It's written in C++ and environment is provided by gym. I created a bindings to interact with the environment. I am not an expert by any means and fairly inexperienced (recently graduated) and hence any contribution from you guys in repo or criticism is very much welcome. I wanna use this opportunity to learn from everyone and make it a project. Repo: https://github.com/kartik2309/RLPack submitted by /u/HovercraftNo9935 [link] [comments]  ( 83 min )
    Is there any good resources to learn about natural policy gradient?
    submitted by /u/Professional_Card176 [link] [comments]  ( 82 min )
    Design of an episode/game in RL for quantitative trading?
    How should we define what is an episode (or game) in RL for quantitative trading? For example, given time series 0 - 499, the agent can either buy/hold/sell at each time step, and the episode ends at time 499. Rewards are given at each time step depending on the change in our total asset value. Or, the agent opens its position by buying or selling at some time step t0 and then closes it by taking the reverse action at another time step t1. Then the episode ends. The agent will start another episode starting from the time after t1. Reward is only given at the end of the episode depending on how much money we make. Which is better or more general? Or are there other designs? All insights or ideas would be appreciated. Thank you :) submitted by /u/Redeemo [link] [comments]  ( 83 min )
  • Open

    Taking the guesswork out of dental care with artificial intelligence
    MIT alumni-founded Overjet analyzes and annotates dental X-rays to help dentists offer more comprehensive care.  ( 8 min )
  • Open

    Score-Guided Intermediate Layer Optimization: Fast Langevin Mixing for Inverse Problems. (arXiv:2206.09104v2 [cs.LG] UPDATED)
    We prove fast mixing and characterize the stationary distribution of the Langevin Algorithm for inverting random weighted DNN generators. This result extends the work of Hand and Voroninski from efficient inversion to efficient posterior sampling. In practice, to allow for increased expressivity, we propose to do posterior sampling in the latent space of a pre-trained generative model. To achieve that, we train a score-based model in the latent space of a StyleGAN-2 and we use it to solve inverse problems. Our framework, Score-Guided Intermediate Layer Optimization (SGILO), extends prior work by replacing the sparsity regularization with a generative prior in the intermediate layer. Experimentally, we obtain significant improvements over the previous state-of-the-art, especially in the low measurement regime.
    Goal Misgeneralization in Deep Reinforcement Learning. (arXiv:2105.14111v3 [cs.LG] UPDATED)
    We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.
    XAI for Transformers: Better Explanations through Conservative Propagation. (arXiv:2202.07304v2 [cs.LG] UPDATED)
    Transformers have become an important workhorse of machine learning, with numerous applications. This necessitates the development of reliable methods for increasing their transparency. Multiple interpretability methods, often based on gradient information, have been proposed. We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. We identify Attention Heads and LayerNorm as main reasons for such unreliable explanations and propose a more stable way for propagation through these layers. Our proposal, which can be seen as a proper extension of the well-established LRP method to Transformers, is shown both theoretically and empirically to overcome the deficiency of a simple gradient-based approach, and achieves state-of-the-art explanation performance on a broad range of Transformer models and datasets.
    A Domain-Theoretic Framework for Robustness Analysis of Neural Networks. (arXiv:2203.00295v2 [cs.LG] UPDATED)
    We present a domain-theoretic framework for validated robustness analysis of neural networks. We first analyze the global robustness of a general class of networks. Then, using the fact that Edalat's domain-theoretic L-derivative coincides with Clarke's generalized gradient, we extend our framework for attack-agnostic local robustness analysis. Our framework is ideal for designing algorithms which are correct by construction. We exemplify this claim by developing a validated algorithm for estimation of Lipschitz constant of feedforward regressors. We prove the completeness of the algorithm over differentiable networks, and also over general position ReLU networks. We obtain computability results within the framework of effectively given domains. Using our domain model, differentiable and non-differentiable networks can be analyzed uniformly. We implement our algorithm using arbitrary-precision interval arithmetic, and present the results of some experiments. Our implementation is truly validated, as it handles floating-point errors as well.
    Learning by non-interfering feedback chemical signaling in physical networks. (arXiv:2203.12098v2 [cond-mat.soft] UPDATED)
    Both non-neural and neural biological systems can learn. So rather than focusing on purely brain-like learning, efforts are underway to study learning in physical systems. Such efforts include equilibrium propagation (EP) and coupled learning (CL), which require storage of two different states-the free state and the perturbed state-during the learning process to retain information about gradients. Inspired by slime mold, we propose a new learning algorithm rooted in chemical signaling that does not require storage of two different states. Rather, the output error information is encoded in a chemical signal that diffuses into the network in a similar way as the activation/feedforward signal. The steady state feedback chemical concentration, along with the activation signal, stores the required gradient information locally. We apply our algorithm using a physical, linear flow network and test it using the Iris data set with 93% accuracy. We also prove that our algorithm performs gradient descent. Finally, in addition to comparing our algorithm directly with EP and CL, we address the biological plausibility of the algorithm.
    Do More Negative Samples Necessarily Hurt in Contrastive Learning?. (arXiv:2205.01789v2 [cs.LG] UPDATED)
    Recent investigations in noise contrastive estimation suggest, both empirically as well as theoretically, that while having more "negative samples" in the contrastive loss improves downstream classification performance initially, beyond a threshold, it hurts downstream performance due to a "collision-coverage" trade-off. But is such a phenomenon inherent in contrastive learning? We show in a simple theoretical setting, where positive pairs are generated by sampling from the underlying latent class (introduced by Saunshi et al. (ICML 2019)), that the downstream performance of the representation optimizing the (population) contrastive loss in fact does not degrade with the number of negative samples. Along the way, we give a structural characterization of the optimal representation in our framework, for noise contrastive estimation. We also provide empirical support for our theoretical results on CIFAR-10 and CIFAR-100 datasets.
    The Integration of Machine Learning into Automated Test Generation: A Systematic Literature Review. (arXiv:2206.10210v2 [cs.SE] UPDATED)
    Context: Machine learning (ML) may enable effective automated test generation. Objective: We characterize emerging research, examining testing practices, researcher goals, ML techniques applied, evaluation, and challenges. Methods: We perform a systematic literature review on a sample of 97 publications. Results: ML generates input for system, GUI, unit, performance, and combinatorial testing or improves the performance of existing generation methods. ML is also used to generate test verdicts, property-based, and expected output oracles. Supervised learning - often based on neural networks - and reinforcement learning - often based on Q-learning - are common, and some publications also employ unsupervised or semi-supervised learning. (Semi-/Un-)Supervised approaches are evaluated using both traditional testing metrics and ML-related metrics (e.g., accuracy), while reinforcement learning is often evaluated using testing metrics tied to the reward function. Conclusion: Work-to-date shows great promise, but there are open challenges regarding training data, retraining, scalability, evaluation complexity, ML algorithms employed - and how they are applied - benchmarks, and replicability. Our findings can serve as a roadmap and inspiration for researchers in this field.
    Automatic Short Math Answer Grading via In-context Meta-learning. (arXiv:2205.15219v2 [cs.CL] UPDATED)
    Automatic short answer grading is an important research direction in the exploration of how to use artificial intelligence (AI)-based tools to improve education. Current state-of-the-art approaches use neural language models to create vectorized representations of students responses, followed by classifiers to predict the score. However, these approaches have several key limitations, including i) they use pre-trained language models that are not well-adapted to educational subject domains and/or student-generated text and ii) they almost always train one model per question, ignoring the linkage across a question and result in a significant model storage problem due to the size of advanced language models. In this paper, we study the problem of automatic short answer grading for students' responses to math questions and propose a novel framework for this task. First, we use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model and fine-tune it for the downstream task of student response grading. Second, we use an in-context learning approach that provides scoring examples as input to the language model to provide additional context information and promote generalization to previously unseen questions. We evaluate our framework on a real-world dataset of student responses to open-ended math questions and show that our framework (often significantly) outperforms existing approaches, especially for new questions that are not seen during training.
    QuAFL: Federated Averaging Can Be Both Asynchronous and Communication-Efficient. (arXiv:2206.10032v2 [cs.LG] UPDATED)
    Federated Learning (FL) is an emerging paradigm to enable the large-scale distributed training of machine learning models, while still providing privacy guarantees. In this work, we jointly address two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between the central server and clients. Specifically, we present a new variant of the classic federated averaging (FedAvg) algorithm, which supports both asynchronous communication and communication compression. We provide a new analysis technique showing that, in spite of these system relaxations, our algorithm essentially matches the best known bounds for FedAvg, under reasonable parameter settings. On the experimental side, we show that our algorithm ensures fast practical convergence for standard federated tasks.
    The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts. (arXiv:2205.01780v2 [eess.AS] UPDATED)
    The ICML Expressive Vocalization (ExVo) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts. The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions. The third, ExVo-FewShot, requires participants to leverage few-shot learning incorporating speaker identity to train a model for the recognition of 10 emotions conveyed by vocal bursts. This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies. The baseline for each track is as follows, for ExVo-MultiTask, a combined score, computing the harmonic mean of Concordance Correlation Coefficient (CCC), Unweighted Average Recall (UAR), and inverted Mean Absolute Error (MAE) ($S_{MTL}$) is at best, 0.335 $S_{MTL}$; for ExVo-Generate, we report Fr\'echet inception distance (FID) scores ranging from 4.81 to 8.27 (depending on the emotion) between the training set and generated samples. We then combine the inverted FID with perceptual ratings of the generated samples ($S_{Gen}$) and obtain 0.174 $S_{Gen}$; and for ExVo-FewShot, a mean CCC of 0.444 is obtained.
    Explicit Explore, Exploit, or Escape ($E^4$): near-optimal safety-constrained reinforcement learning in polynomial time. (arXiv:2111.07395v2 [cs.LG] UPDATED)
    In reinforcement learning (RL), an agent must explore an initially unknown environment in order to learn a desired behaviour. When RL agents are deployed in real world environments, safety is of primary concern. Constrained Markov decision processes (CMDPs) can provide long-term safety constraints; however, the agent may violate the constraints in an effort to explore its environment. This paper proposes a model-based RL algorithm called Explicit Explore, Exploit, or Escape ($E^{4}$), which extends the Explicit Explore or Exploit ($E^{3}$) algorithm to a robust CMDP setting. $E^4$ explicitly separates exploitation, exploration, and escape CMDPs, allowing targeted policies for policy improvement across known states, discovery of unknown states, as well as safe return to known states. $E^4$ robustly optimises these policies on the worst-case CMDP from a set of CMDP models consistent with the empirical observations of the deployment environment. Theoretical results show that $E^4$ finds a near-optimal constraint-satisfying policy in polynomial time whilst satisfying safety constraints throughout the learning process. We then discuss $E^4$ as a practical algorithmic framework, including robust-constrained offline optimisation algorithms, the design of uncertainty sets for the transition dynamics of unknown states, and how to further leverage empirical observations and prior knowledge to relax some of the worst-case assumptions underlying the theory.
    Wasserstein t-SNE. (arXiv:2205.07531v2 [cs.LG] UPDATED)
    Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data.
    Single-Shot Optical Neural Network. (arXiv:2205.09103v2 [cs.ET] UPDATED)
    As deep neural networks (DNNs) grow to solve increasingly complex problems, they are becoming limited by the latency and power consumption of existing digital processors. For improved speed and energy efficiency, specialized analog optical and electronic hardware has been proposed, however, with limited scalability (input vector length $K$ of hundreds of elements). Here, we present a scalable, single-shot-per-layer analog optical processor that uses free-space optics to reconfigurably distribute an input vector and integrated optoelectronics for static, updatable weighting and the nonlinearity -- with $K \approx 1,000$ and beyond. We experimentally test classification accuracy of the MNIST handwritten digit dataset, achieving 94.7% (ground truth 96.3%) without data preprocessing or retraining on the hardware. We also determine the fundamental upper bound on throughput ($\sim$0.9 exaMAC/s), set by the maximum optical bandwidth before significant increase in error. Our combination of wide spectral and spatial bandwidths in a CMOS-compatible system enables highly efficient computing for next-generation DNNs.
    Robust Federated Learning via Over-The-Air Computation. (arXiv:2111.01221v4 [cs.LG] UPDATED)
    This paper investigates the robustness of over-the-air federated learning to Byzantine attacks. The simple averaging of the model updates via over-the-air computation makes the learning task vulnerable to random or intended modifications of the local model updates of some malicious clients. We propose a robust transmission and aggregation framework to such attacks while preserving the benefits of over-the-air computation for federated learning. For the proposed robust federated learning, the participating clients are randomly divided into groups and a transmission time slot is allocated to each group. The parameter server aggregates the results of the different groups using a robust aggregation technique and conveys the result to the clients for another training round. We also analyze the convergence of the proposed algorithm. Numerical simulations confirm the robustness of the proposed approach to Byzantine attacks.
    CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-horizon Robot Manipulation Tasks. (arXiv:2112.03227v3 [cs.RO] UPDATED)
    General-purpose robots coexisting with humans in their environment must learn to relate human language to their perceptions and actions to be useful in a range of daily tasks. Moreover, they need to acquire a diverse repertoire of general-purpose skills that allow composing long-horizon tasks by following unconstrained language instructions. In this paper, we present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites. We evaluate the agents in zero-shot to novel language instructions and to novel environments and objects. We show that a baseline model based on multi-context imitation learning performs poorly on CALVIN, suggesting that there is significant room for developing innovative agents that learn to relate human language to their world models with this benchmark.
    Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks. (arXiv:2203.00199v5 [cs.LG] UPDATED)
    Graph neural networks (GNN) have shown great advantages in many graph-based learning tasks but often fail to predict accurately for a task-based on sets of nodes such as link/motif prediction and so on. Many works have recently proposed to address this problem by using random node features or node distance features. However, they suffer from either slow convergence, inaccurate prediction, or high complexity. In this work, we revisit GNNs that allow using positional features of nodes given by positional encoding (PE) techniques such as Laplacian Eigenmap, Deepwalk, etc. GNNs with PE often get criticized because they are not generalizable to unseen graphs (inductive) or stable. Here, we study these issues in a principled way and propose a provable solution, a class of GNN layers termed PEG with rigorous mathematical analysis. PEG uses separate channels to update the original node features and positional features. PEG imposes permutation equivariance w.r.t. the original node features and imposes $O(p)$ (orthogonal group) equivariance w.r.t. the positional features simultaneously, where $p$ is the dimension of used positional features. Extensive link prediction experiments over 8 real-world networks demonstrate the advantages of PEG in generalization and scalability.
    Conditional Generative Data Augmentation for Clinical Audio Datasets. (arXiv:2203.11570v2 [cs.SD] UPDATED)
    In this work, we propose a novel data augmentation method for clinical audio datasets based on a conditional Wasserstein Generative Adversarial Network with Gradient Penalty (cWGAN-GP), operating on log-mel spectrograms. To validate our method, we created a clinical audio dataset which was recorded in a real-world operating room during Total Hip Arthroplasty (THA) procedures and contains typical sounds which resemble the different phases of the intervention. We demonstrate the capability of the proposed method to generate realistic class-conditioned samples from the dataset distribution and show that training with the generated augmented samples outperforms classical audio augmentation methods in terms of classification accuracy. The performance was evaluated using a ResNet-18 classifier which shows a mean per-class accuracy improvement of 1.70% in a 5-fold cross validation experiment using the proposed augmentation method. Because clinical data is often expensive to acquire, the development of realistic and high-quality data augmentation methods is crucial to improve the robustness and generalization capabilities of learning-based algorithms which is especially important for safety-critical medical applications. Therefore, the proposed data augmentation method is an important step towards improving the data bottleneck for clinical audio-based machine learning systems.
    Flashlight: Enabling Innovation in Tools for Machine Learning. (arXiv:2201.12465v2 [cs.LG] UPDATED)
    As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together. Flashlight is available at https://github.com/flashlight/flashlight .
    Stability vs Implicit Bias of Gradient Methods on Separable Data and Beyond. (arXiv:2202.13441v2 [cs.LG] UPDATED)
    An influential line of recent work has focused on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification with exponentially-tailed loss functions. The ability of such methods to generalize well has been attributed to the their implicit bias towards large margin predictors, both asymptotically as well as in finite time. We give an additional unified explanation for this generalization and relate it to two simple properties of the optimization objective, that we refer to as realizability and self-boundedness. We introduce a general setting of unconstrained stochastic convex optimization with these properties, and analyze generalization of gradient methods through the lens of algorithmic stability. In this broader setting, we obtain sharp stability bounds for gradient descent and stochastic gradient descent which apply even for a very large number of gradient steps, and use them to derive general generalization bounds for these algorithms. Finally, as direct applications of the general bounds, we return to the setting of linear classification with separable data and establish several novel test loss and test accuracy bounds for gradient descent and stochastic gradient descent for a variety of loss functions with different tail decay rates. In some of these cases, our bounds significantly improve upon the existing generalization error bounds in the literature.
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v3 [stat.ML] UPDATED)
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.
    COLA: Consistent Learning with Opponent-Learning Awareness. (arXiv:2203.04098v2 [cs.LG] UPDATED)
    Learning in general-sum games is unstable and frequently leads to socially undesirable (Pareto-dominated) outcomes. To mitigate this, Learning with Opponent-Learning Awareness (LOLA) introduced opponent shaping to this setting, by accounting for each agent's influence on their opponents' anticipated learning steps. However, the original LOLA formulation (and follow-up work) is inconsistent because LOLA models other agents as naive learners rather than LOLA agents. In previous work, this inconsistency was suggested as a cause of LOLA's failure to preserve stable fixed points (SFPs). First, we formalize consistency and show that higher-order LOLA (HOLA) solves LOLA's inconsistency problem if it converges. Second, we correct a claim made in the literature by Sch\"afer and Anandkumar (2019), proving that Competitive Gradient Descent (CGD) does not recover HOLA as a series expansion (and fails to solve the consistency problem). Third, we propose a new method called Consistent LOLA (COLA), which learns update functions that are consistent under mutual opponent shaping. It requires no more than second-order derivatives and learns consistent update functions even when HOLA fails to converge. However, we also prove that even consistent update functions do not preserve SFPs, contradicting the hypothesis that this shortcoming is caused by LOLA's inconsistency. Finally, in an empirical evaluation on a set of general-sum games, we find that COLA finds prosocial solutions and that it converges under a wider range of learning rates than HOLA and LOLA. We support the latter finding with a theoretical result for a simple game.
    A walk through of time series analysis on quantum computers. (arXiv:2205.00986v2 [quant-ph] UPDATED)
    Because of the rotational components on quantum circuits, some quantum neural networks based on variational circuits can be considered equivalent to the classical Fourier networks. In addition, they can be used to predict the Fourier coefficients of continuous functions. Time series data indicates a state of a variable in time. Since some time series data can be also considered as continuous functions, we can expect quantum machine learning models to do many data analysis tasks successfully on time series data. Therefore, it is important to investigate new quantum logics for temporal data processing and analyze intrinsic relationships of data on quantum computers. In this paper, we go through the quantum analogues of classical data preprocessing and forecasting with ARIMA models by using simple quantum operators requiring a few number of quantum gates. Then we discuss future directions and some of the tools/algorithms that can be used for temporal data analysis on quantum computers.
    Sequential Importance Sampling for Hybrid Model Bayesian Inference to Support Bioprocess Mechanism Learning and Robust Control. (arXiv:2205.02410v3 [stat.ML] UPDATED)
    Driven by the critical needs of biomanufacturing 4.0, we introduce a probabilistic knowledge graph hybrid model characterizing the risk- and science-based understanding of bioprocess mechanisms. It can faithfully capture the important properties, including nonlinear reactions, partially observed state, and nonstationary dynamics. Given very limited real process observations, we derive a posterior distribution quantifying model estimation uncertainty. To avoid the evaluation of intractable likelihoods, Approximate Bayesian Computation sampling with Sequential Monte Carlo (ABC-SMC) is utilized to approximate the posterior distribution. Under high stochastic and model uncertainties, it is computationally expensive to match output trajectories. Therefore, we create a linear Gaussian dynamic Bayesian network (LG-DBN) auxiliary likelihood-based ABC-SMC approach. Through matching the summary statistics driven through LG-DBN likelihood that can capture critical interactions and variations, the proposed algorithm can accelerate hybrid model inference, support process monitoring, and facilitate mechanism learning and robust control.
    Adversarial Learning with Cost-Sensitive Classes. (arXiv:2101.12372v2 [cs.LG] UPDATED)
    It is necessary to improve the performance of some special classes or to particularly protect them from attacks in adversarial learning. This paper proposes a framework combining cost-sensitive classification and adversarial learning together to train a model that can distinguish between protected and unprotected classes, such that the protected classes are less vulnerable to adversarial examples. We find in this framework an interesting phenomenon during the training of deep neural networks, called Min-Max property, that is, the absolute values of most parameters in the convolutional layer approach zero while the absolute values of a few parameters are significantly larger becoming bigger. Based on this Min-Max property which is formulated and analyzed in a view of random distribution, we further build a new defense model against adversarial examples for adversarial robustness improvement. An advantage of the built model is that it performs better than the standard one and can combine with adversarial training to achieve an improved performance. It is experimentally confirmed that, regarding the average accuracy of all classes, our model is almost as same as the existing models when an attack does not occur and is better than the existing models when an attack occurs. Specifically, regarding the accuracy of protected classes, the proposed model is much better than the existing models when an attack occurs.
    MaskViT: Masked Visual Pre-Training for Video Prediction. (arXiv:2206.11894v1 [cs.CV])
    The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.
    Importance of Kernel Bandwidth in Quantum Machine Learning. (arXiv:2111.05451v3 [quant-ph] UPDATED)
    Quantum kernel methods are considered a promising avenue for applying quantum computers to machine learning problems. Identifying hyperparameters controlling the inductive bias of quantum machine learning models is expected to be crucial given the central role hyperparameters play in determining the performance of classical machine learning methods. In this work we introduce the hyperparameter controlling the bandwidth of a quantum kernel and show that it controls the expressivity of the resulting model. We use extensive numerical experiments with multiple quantum kernels and classical datasets to show consistent change in the model behavior from underfitting (bandwidth too large) to overfitting (bandwidth too small), with optimal generalization in between. We draw a connection between the bandwidth of classical and quantum kernels and show analogous behavior in both cases. Furthermore, we show that optimizing the bandwidth can help mitigate the exponential decay of kernel values with qubit count, which is the cause behind recent observations that the performance of quantum kernel methods decreases with qubit count. We reproduce these negative results and show that if the kernel bandwidth is optimized, the performance instead improves with growing qubit count and becomes competitive with the best classical methods.
    Diagnosing and Fixing Manifold Overfitting in Deep Generative Models. (arXiv:2204.07172v2 [stat.ML] UPDATED)
    Likelihood-based, or explicit, deep generative models use neural networks to construct flexible high-dimensional densities. This formulation directly contradicts the manifold hypothesis, which states that observed data lies on a low-dimensional manifold embedded in high-dimensional ambient space. In this paper we investigate the pathologies of maximum-likelihood training in the presence of this dimensionality mismatch. We formally prove that degenerate optima are achieved wherein the manifold itself is learned but not the distribution on it, a phenomenon we call manifold overfitting. We propose a class of two-step procedures consisting of a dimensionality reduction step followed by maximum-likelihood density estimation, and prove that they recover the data-generating distribution in the nonparametric regime, thus avoiding manifold overfitting. We also show that these procedures enable density estimation on the manifolds learned by implicit models, such as generative adversarial networks, hence addressing a major shortcoming of these models. Several recently proposed methods are instances of our two-step procedures; we thus unify, extend, and theoretically justify a large class of models.
    LEAN: graph-based pruning for convolutional neural networks by extracting longest chains. (arXiv:2011.06923v3 [cs.LG] UPDATED)
    Neural network pruning techniques can substantially reduce the computational cost of applying convolutional neural networks (CNNs). Common pruning methods determine which convolutional filters to remove by ranking the filters individually, i.e., without taking into account their interdependence. In this paper, we advocate the viewpoint that pruning should consider the interdependence between series of consecutive operators. We propose the LongEst-chAiN (LEAN) method that prunes CNNs by using graph-based algorithms to select relevant chains of convolutions. A CNN is interpreted as a graph, with the operator norm of each operator as distance metric for the edges. LEAN pruning iteratively extracts the highest value path from the graph to keep. In our experiments, we test LEAN pruning on several image-to-image tasks, including the well-known CamVid dataset, and a real-world X-ray CT dataset. Results indicate that LEAN pruning can result in networks with similar accuracy, while using 1.7-12x fewer convolutional filters than existing approaches.
    Keys to Accurate Feature Extraction Using Residual Spiking Neural Networks. (arXiv:2111.05955v4 [cs.LG] UPDATED)
    Spiking neural networks (SNNs) have become an interesting alternative to conventional artificial neural networks (ANN) thanks to their temporal processing capabilities and energy efficient implementations in neuromorphic hardware. However the challenges involved in training SNNs have limited their performance in terms of accuracy and thus their applications. Improving learning algorithms and neural architectures for a more accurate feature extraction is therefore one of the current priorities in SNN research. In this paper we present a study on the key components of modern spiking architectures. We design a spiking version of the successful residual network architecture and provide an in-depth study on the possible implementations of spiking residual connections. This study shows how, depending on the use case, the optimal residual connection implementation may vary. Additionally, we empirically compare different techniques in image classification datasets taken from the best performing networks. Our results provide a state of the art guide to SNN design, which allows to make informed choices when trying to build the optimal visual feature extractor. Finally, our network outperforms previous SNN architectures in CIFAR-10 (94.14%) and CIFAR-100 (74.65%) datasets and matches the state of the art in DVS-CIFAR10 (72.98%), with less parameters than the previous state of the art and without the need for ANN-SNN conversion. Code available at https://github.com/VicenteAlex/Spiking_ResNet
    Teacher Model Fingerprinting Attacks Against Transfer Learning. (arXiv:2106.12478v2 [cs.CR] UPDATED)
    Transfer learning has become a common solution to address training data scarcity in practice. It trains a specified student model by reusing or fine-tuning early layers of a well-trained teacher model that is usually publicly available. However, besides utility improvement, the transferred public knowledge also brings potential threats to model confidentiality, and even further raises other security and privacy issues. In this paper, we present the first comprehensive investigation of the teacher model exposure threat in the transfer learning context, aiming to gain a deeper insight into the tension between public knowledge and model confidentiality. To this end, we propose a teacher model fingerprinting attack to infer the origin of a student model, i.e., the teacher model it transfers from. Specifically, we propose a novel optimization-based method to carefully generate queries to probe the student model to realize our attack. Unlike existing model reverse engineering approaches, our proposed fingerprinting method neither relies on fine-grained model outputs, e.g., posteriors, nor auxiliary information of the model architecture or training dataset. We systematically evaluate the effectiveness of our proposed attack. The empirical results demonstrate that our attack can accurately identify the model origin with few probing queries. Moreover, we show that the proposed attack can serve as a stepping stone to facilitating other attacks against machine learning models, such as model stealing.
    Matrix-wise $\ell_0$-constrained Sparse Nonnegative Least Squares. (arXiv:2011.11066v4 [cs.LG] UPDATED)
    Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further enhance this sparsity, as it improves the interpretability of the results and helps reducing noise, which leads to the sparse MNNLS problem. In this paper, as opposed to most previous works that enforce sparsity column- or row-wise, we first introduce a novel formulation for sparse MNNLS, with a matrix-wise sparsity constraint. Then, we present a two-step algorithm to tackle this problem. The first step divides sparse MNNLS in subproblems, one per column of the original problem. It then uses different algorithms to produce, either exactly or approximately, a Pareto front for each subproblem, that is, to produce a set of solutions representing different tradeoffs between reconstruction error and sparsity. The second step selects solutions among these Pareto fronts in order to build a sparsity-constrained matrix that minimizes the reconstruction error. We perform experiments on facial and hyperspectral images, and we show that our proposed two-step approach provides more accurate results than state-of-the-art sparse coding heuristics applied both column-wise and globally.
    Hermite Polynomial Features for Private Data Generation. (arXiv:2106.05042v4 [cs.LG] UPDATED)
    Kernel mean embedding is a useful tool to represent and compare probability measures. Despite its usefulness, kernel mean embedding considers infinite-dimensional features, which are challenging to handle in the context of differentially private data generation. A recent work proposes to approximate the kernel mean embedding of data distribution using finite-dimensional random features, which yields analytically tractable sensitivity. However, the number of required random features is excessively high, often ten thousand to a hundred thousand, which worsens the privacy-accuracy trade-off. To improve the trade-off, we propose to replace random features with Hermite polynomial features. Unlike the random features, the Hermite polynomial features are ordered, where the features at the low orders contain more information on the distribution than those at the high orders. Hence, a relatively low order of Hermite polynomial features can more accurately approximate the mean embedding of the data distribution compared to a significantly higher number of random features. As demonstrated on several tabular and image datasets, Hermite polynomial features seem better suited for private data generation than random Fourier features.
    Discriminative Similarity for Data Clustering. (arXiv:2109.08675v3 [cs.LG] UPDATED)
    Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose {\em Clustering by Discriminative Similarity (CDS)}, a novel method which learns discriminative similarity for data clustering. CDS learns an unsupervised similarity-based classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learnt classifiers associated with the data partitions. By generalization analysis via Rademacher complexity, the generalization error bound for the unsupervised similarity-based classifier is expressed as the sum of discriminative similarity between the data from different classes. It is proved that the derived discriminative similarity can also be induced by the integrated squared error bound for kernel density classification. In order to evaluate the performance of the proposed discriminative similarity, we propose a new clustering method using a kernel as the similarity function, CDS via unsupervised kernel classification (CDSK), with its effectiveness demonstrated by experimental results.
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v1 [cs.LG])
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space. (arXiv:2206.11895v1 [cs.CV])
    Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/~jishang/3dtrl/3dtrl.html
    Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters. (arXiv:2003.12739v3 [cs.CV] UPDATED)
    How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a U-Net-based model and perform experiments on two language-vision dense-prediction tasks: referring expression segmentation and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance. Our linguistic analysis suggests that bottom-up conditioning improves segmentation of objects especially when input text refers to low-level visual concepts. Code is available at https://github.com/ilkerkesen/bvpr.
    On compression rate of quantum autoencoders: Control design, numerical and experimental realization. (arXiv:2005.11149v2 [quant-ph] UPDATED)
    Quantum autoencoders which aim at compressing quantum information in a low-dimensional latent space lie in the heart of automatic data compression in the field of quantum information. In this paper, we establish an upper bound of the compression rate for a given quantum autoencoder and present a learning control approach for training the autoencoder to achieve the maximal compression rate. The upper bound of the compression rate is theoretically proven using eigen-decomposition and matrix differentiation, which is determined by the eigenvalues of the density matrix representation of the input states. Numerical results on 2-qubit and 3-qubit systems are presented to demonstrate how to train the quantum autoencoder to achieve the theoretically maximal compression, and the training performance using different machine learning algorithms is compared. Experimental results of a quantum autoencoder using quantum optical systems are illustrated for compressing two 2-qubit states into two 1-qubit states.
    Approximation Benefits of Policy Gradient Methods with Aggregated States. (arXiv:2007.11684v3 [cs.LG] UPDATED)
    Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.
    How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. (arXiv:2106.10270v2 [cs.CV] UPDATED)
    Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugReg" for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
    Identify treatment effect patterns for personalised decisions. (arXiv:1906.06080v2 [stat.ME] UPDATED)
    In personalised decision making, evidence is required to determine whether an action (treatment) is suitable for an individual. Such evidence can be obtained by modelling treatment effect heterogeneity in subgroups. The existing interpretable modelling methods take a top-down approach to search for subgroups with heterogeneous treatment effects and they may miss the most specific and relevant context for an individual. In this paper, we design a \emph{Treatment effect pattern (TEP)} to represent treatment effect heterogeneity in data. To achieve an interpretable presentation of TEPs, we use a local causal structure around the outcome to explicitly show how those important variables are used in modelling. We also derive a formula for unbiasedly estimating the \emph{Conditional Average Causal Effect (CATE)} using the local structure in our problem setting. In the discovery process, we aim at minimising heterogeneity within each subgroup represented by a pattern. We propose a bottom-up search algorithm to discover the most specific patterns fitting individual circumstances the best for personalised decision making. Experiments show that the proposed method models treatment effect heterogeneity better than three other existing tree based methods in synthetic and real world data sets.
    Predicting the meal macronutrient composition from continuous glucose monitors. (arXiv:2206.11878v1 [q-bio.QM])
    Sustained high levels of blood glucose in type 2 diabetes (T2DM) can have disastrous long-term health consequences. An essential component of clinical interventions for T2DM is monitoring dietary intake to keep plasma glucose levels within an acceptable range. Yet, current techniques to monitor food intake are time intensive and error prone. To address this issue, we are developing techniques to automatically monitor food intake and the composition of those foods using continuous glucose monitors (CGMs). This article presents the results of a clinical study in which participants consumed nine standardized meals with known macronutrients amounts (carbohydrate, protein, and fat) while wearing a CGM. We built a multitask neural network to estimate the macronutrient composition from the CGM signal, and compared it against a baseline linear regression. The best prediction result comes from our proposed neural network, trained with subject-dependent data, as measured by root mean squared relative error and correlation coefficient. These findings suggest that it is possible to estimate macronutrient composition from CGM signals, opening the possibility to develop automatic techniques to track food intake.
    Quantum Approximation of Normalized Schatten Norms and Applications to Learning. (arXiv:2206.11506v1 [quant-ph])
    Efficient measures to determine similarity of quantum states, such as the fidelity metric, have been widely studied. In this paper, we address the problem of defining a similarity measure for quantum operations that can be \textit{efficiently estimated}. Given two quantum operations, $U_1$ and $U_2$, represented in their circuit forms, we first develop a quantum sampling circuit to estimate the normalized Schatten 2-norm of their difference ($\| U_1-U_2 \|_{S_2}$) with precision $\epsilon$, using only one clean qubit and one classical random variable. We prove a Poly$(\frac{1}{\epsilon})$ upper bound on the sample complexity, which is independent of the size of the quantum system. We then show that such a similarity metric is directly related to a functional definition of similarity of unitary operations using the conventional fidelity metric of quantum states ($F$): If $\| U_1-U_2 \|_{S_2}$ is sufficiently small (e.g. $ \leq \frac{\epsilon}{1+\sqrt{2(1/\delta - 1)}}$) then the fidelity of states obtained by processing the same randomly and uniformly picked pure state, $|\psi \rangle$, is as high as needed ($F({U}_1 |\psi \rangle, {U}_2 |\psi \rangle)\geq 1-\epsilon$) with probability exceeding $1-\delta$. We provide example applications of this efficient similarity metric estimation framework to quantum circuit learning tasks, such as finding the square root of a given unitary operation.
    Factorization of the Partial Covariance in Singly-Connected Path Diagrams. (arXiv:2002.05226v6 [stat.ME] UPDATED)
    We extend path analysis by showing that, for a singly-connected path diagram, the partial covariance of two random variables factorizes over the nodes and edges in the path between the variables. This result allows us to determine the contribution of each node and edge to the partial covariance. It also allows us to show that Simpson's paradox cannot occur in singly-connected path diagrams.
    MHNF: Multi-hop Heterogeneous Neighborhood information Fusion graph representation learning. (arXiv:2106.09289v2 [cs.LG] UPDATED)
    The attention mechanism enables graph neural networks (GNNs) to learn the attention weights between the target node and its one-hop neighbors, thereby improving the performance further. However, most existing GNNs are oriented toward homogeneous graphs, and in which each layer can only aggregate the information of one-hop neighbors. Stacking multilayer networks introduces considerable noise and easily leads to over smoothing. We propose here a multihop heterogeneous neighborhood information fusion graph representation learning method (MHNF). Specifically, we propose a hybrid metapath autonomous extraction model to efficiently extract multihop hybrid neighbors. Then, we formulate a hop-level heterogeneous information aggregation model, which selectively aggregates different-hop neighborhood information within the same hybrid metapath. Finally, a hierarchical semantic attention fusion model (HSAF) is constructed, which can efficiently integrate different-hop and different-path neighborhood information. In this fashion, this paper solves the problem of aggregating multihop neighborhood information and learning hybrid metapaths for target tasks. This mitigates the limitation of manually specifying metapaths. In addition, HSAF can extract the internal node information of the metapaths and better integrate the semantic information present at different levels. Experimental results on real datasets show that MHNF achieves the best or competitive performance against state-of-the-art baselines with only a fraction of 1/10 ~ 1/100 parameters and computational budgets. Our code is publicly available at https://github.com/PHD-lanyu/MHNF.
    Graph Neural Networks for Temperature-Dependent Activity Coefficient Prediction of Solutes in Ionic Liquids. (arXiv:2206.11776v1 [cs.LG])
    Ionic liquids (ILs) are important solvents for sustainable processes and predicting activity coefficients (ACs) of solutes in ILs is needed. Recently, matrix completion methods (MCMs), transformers, and graph neural networks (GNNs) have shown high accuracy in predicting ACs of binary mixtures, superior to well-established models, e.g., COSMO-RS and UNIFAC. GNNs are particularly promising here as they learn a molecular graph-to-property relationship without pretraining, typically required for transformers, and are, unlike MCMs, applicable to molecules not included in training. For ILs, however, GNN applications are currently missing. Herein, we present a GNN to predict temperature-dependent infinite dilution ACs of solutes in ILs. We train the GNN on a database including more than 40,000 AC values and compare it to a state-of-the-art MCM. The GNN and MCM achieve similar high prediction performance, with the GNN additionally enabling high-quality predictions for ACs of solutions that contain ILs and solutes not considered during training.
    Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. (arXiv:2206.11795v1 [cs.LG])
    Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.
    Incorporating Hidden Layer representation into Adversarial Attacks and Defences. (arXiv:2011.14045v2 [cs.LG] UPDATED)
    In this paper, we propose a defence strategy to improve adversarial robustness by incorporating hidden layer representation. The key of this defence strategy aims to compress or filter input information including adversarial perturbation. And this defence strategy can be regarded as an activation function which can be applied to any kind of neural network. We also prove theoretically the effectiveness of this defense strategy under certain conditions. Besides, incorporating hidden layer representation we propose three types of adversarial attacks to generate three types of adversarial examples, respectively. The experiments show that our defence method can significantly improve the adversarial robustness of deep neural networks which achieves the state-of-the-art performance even though we do not adopt adversarial training.
    Layer-wise and Dimension-wise Locally Adaptive Federated Learning. (arXiv:2110.00532v3 [cs.LG] UPDATED)
    In the emerging paradigm of Federated Learning (FL), large amount of clients such as mobile devices are used to train possibly high-dimensional models on their respective data. Combining (dimension-wise) adaptive gradient methods (e.g. Adam, AMSGrad) with FL has been an active direction, which is shown to outperform traditional SGD based FL in many cases. In this paper, we focus on the problem of training federated deep neural networks, and propose a novel FL framework which further introduces layer-wise adaptivity to the local model updates. Our framework can be applied to locally adaptive FL methods including two recent algorithms, Mime and Fed-AMS. Theoretically, we provide a convergence analysis of our layer-wise FL methods, coined Fed-LAMB and Mime-LAMB, which matches the convergence rate of state-of-the-art results in FL and exhibits linear speedup in terms of the number of workers. Experimental results on various datasets and models, under both IID and non-IID local data settings, show that both Fed-LAMB and Mime-LAMB achieve faster convergence speed and better generalization performance, compared to the various recent adaptive FL methods.
    Efficient Transformer-based Speech Enhancement Using Long Frames and STFT Magnitudes. (arXiv:2206.11703v1 [eess.AS])
    The SepFormer architecture shows very good results in speech separation. Like other learned-encoder models, it uses short frames, as they have been shown to obtain better performance in these cases. This results in a large number of frames at the input, which is problematic; since the SepFormer is transformer-based, its computational complexity drastically increases with longer sequences. In this paper, we employ the SepFormer in a speech enhancement task and show that by replacing the learned-encoder features with a magnitude short-time Fourier transform (STFT) representation, we can use long frames without compromising perceptual enhancement performance. We obtained equivalent quality and intelligibility evaluation scores while reducing the number of operations by a factor of approximately 8 for a 10-second utterance.
    Measuring the Feasibility of Analogical Transfer using Complexity. (arXiv:2206.11753v1 [cs.AI])
    Analogies are 4-ary relations of the form "A is to B as C is to D". While focus has been mostly on how to solve an analogy, i.e. how to find correct values of D given A, B and C, less attention has been drawn on whether solving such an analogy was actually feasible. In this paper, we propose a quantification of the transferability of a source case (A and B) to solve a target problem C. This quantification is based on a complexity minimization principle which has been demonstrated to be efficient for solving analogies. We illustrate these notions on morphological analogies and show its connections with machine learning, and in particular with Unsupervised Domain Adaptation.
    Non-Determinism and the Lawlessness of ML Code. (arXiv:2206.11834v1 [cs.CY])
    Legal literature on machine learning (ML) tends to focus on harms, and as a result tends to reason about individual model outcomes and summary error rates. This focus on model-level outcomes and errors has masked important aspects of ML that are rooted in its inherent non-determinism. We show that the effects of non-determinism, and consequently its implications for the law, instead become clearer from the perspective of reasoning about ML outputs as probability distributions over possible outcomes. This distributional viewpoint accounts for non-determinism by emphasizing the possible outcomes of ML. Importantly, this type of reasoning is not exclusive with current legal reasoning; it complements (and in fact can strengthen) analyses concerning individual, concrete outcomes for specific automated decisions. By clarifying the important role of non-determinism, we demonstrate that ML code falls outside of the cyberlaw frame of treating "code as law," as this frame assumes that code is deterministic. We conclude with a brief discussion of what work ML can do to constrain the potentially harm-inducing effects of non-determinism, and we clarify where the law must do work to bridge the gap between its current individual-outcome focus and the distributional approach that we recommend.
    Capacity Optimality of OAMP in Coded Large Unitarily Invariant Systems. (arXiv:2206.11680v1 [cs.IT])
    This paper investigates a large unitarily invariant system (LUIS) involving a unitarily invariant sensing matrix, an arbitrary fixed signal distribution, and forward error control (FEC) coding. Several area properties are established based on the state evolution of orthogonal approximate message passing (OAMP) in an un-coded LUIS. Under the assumptions that the state evolution for joint OAMP and FEC decoding is correct and the replica method is reliable, we analyze the achievable rate of OAMP. We prove that OAMP reaches the constrained capacity predicted by the replica method of the LUIS with an arbitrary signal distribution based on matched FEC coding. Meanwhile, we elaborate a constrained capacity-achieving coding principle for LUIS, based on which irregular low-density parity-check (LDPC) codes are optimized for binary signaling in the simulation results. We show that OAMP with the optimized codes has significant performance improvement over the un-optimized ones and the well-known Turbo linear MMSE algorithm. For quadrature phase-shift keying (QPSK) modulation, constrained capacity-approaching bit error rate (BER) performances are observed under various channel conditions.
    Chasing Convex Bodies and Functions with Black-Box Advice. (arXiv:2206.11780v1 [cs.LG])
    We consider the problem of convex function chasing with black-box advice, where an online decision-maker aims to minimize the total cost of making and switching between decisions in a normed vector space, aided by black-box advice such as the decisions of a machine-learned algorithm. The decision-maker seeks cost comparable to the advice when it performs well, known as $\textit{consistency}$, while also ensuring worst-case $\textit{robustness}$ even when the advice is adversarial. We first consider the common paradigm of algorithms that switch between the decisions of the advice and a competitive algorithm, showing that no algorithm in this class can improve upon 3-consistency while staying robust. We then propose two novel algorithms that bypass this limitation by exploiting the problem's convexity. The first, INTERP, achieves $(\sqrt{2}+\epsilon)$-consistency and $\mathcal{O}(\frac{C}{\epsilon^2})$-robustness for any $\epsilon > 0$, where $C$ is the competitive ratio of an algorithm for convex function chasing or a subclass thereof. The second, BDINTERP, achieves $(1+\epsilon)$-consistency and $\mathcal{O}(\frac{CD}{\epsilon})$-robustness when the problem has bounded diameter $D$. Further, we show that BDINTERP achieves near-optimal consistency-robustness trade-off for the special case where cost functions are $\alpha$-polyhedral.
    A Topological characterisation of Weisfeiler-Leman equivalence classes. (arXiv:2206.11876v1 [cs.LG])
    Graph Neural Networks (GNNs) are learning models aimed at processing graphs and signals on graphs. The most popular and successful GNNs are based on message passing schemes. Such schemes inherently have limited expressive power when it comes to distinguishing two non-isomorphic graphs. In this article, we rely on the theory of covering spaces to fully characterize the classes of graphs that GNNs cannot distinguish. We then generate arbitrarily many non-isomorphic graphs that cannot be distinguished by GNNs, leading to the GraphCovers dataset. We also show that the number of indistinguishable graphs in our dataset grows super-exponentially with the number of nodes. Finally, we test the GraphCovers dataset on several GNN architectures, showing that none of them can distinguish any two graphs it contains.
    AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models. (arXiv:2206.11719v1 [cs.CL])
    The objective of pre-trained language models is to learn contextual representations of textual data. Pre-trained language models have become mainstream in natural language processing and code modeling. Using probes, a technique to study the linguistic properties of hidden vector spaces, previous works have shown that these pre-trained language models encode simple linguistic properties in their hidden representations. However, none of the previous work assessed whether these models encode the whole grammatical structure of a programming language. In this paper, we prove the existence of a \textit{syntactic subspace}, lying in the hidden representations of pre-trained language models, which contain the syntactic information of the programming language. We show that this subspace can be extracted from the models' representations and define a novel probing method, the AST-Probe, that enables recovering the whole abstract syntax tree (AST) of an input code snippet. In our experimentations, we show that this syntactic subspace exists in five state-of-the-art pre-trained language models. In addition, we highlight that the middle layers of the models are the ones that encode most of the AST information. Finally, we estimate the optimal size of this syntactic subspace and show that its dimension is substantially lower than those of the models' representation spaces. This suggests that pre-trained language models use a small part of their representation spaces to encode syntactic information of the programming languages.
    Sample Condensation in Online Continual Learning. (arXiv:2206.11849v1 [cs.LG])
    Online Continual learning is a challenging learning scenario where the model must learn from a non-stationary stream of data where each sample is seen only once. The main challenge is to incrementally learn while avoiding catastrophic forgetting, namely the problem of forgetting previously acquired knowledge while learning from new data. A popular solution in these scenario is to use a small memory to retain old data and rehearse them over time. Unfortunately, due to the limited memory size, the quality of the memory will deteriorate over time. In this paper we propose OLCGM, a novel replay-based continual learning strategy that uses knowledge condensation techniques to continuously compress the memory and achieve a better use of its limited size. The sample condensation step compresses old samples, instead of removing them like other replay strategies. As a result, the experiments show that, whenever the memory budget is limited compared to the complexity of the data, OLCGM improves the final accuracy compared to state-of-the-art replay strategies.
    Provable Acceleration of Heavy Ball beyond Quadratics for a Class of Polyak-\L{}ojasiewicz Functions when the Non-Convexity is Averaged-Out. (arXiv:2206.11872v1 [math.OC])
    Heavy Ball (HB) nowadays is one of the most popular momentum methods in non-convex optimization. It has been widely observed that incorporating the Heavy Ball dynamic in gradient-based methods accelerates the training process of modern machine learning models. However, the progress on establishing its theoretical foundation of acceleration is apparently far behind its empirical success. Existing provable acceleration results are of the quadratic or close-to-quadratic functions, as the current techniques of showing HB's acceleration are limited to the case when the Hessian is fixed. In this work, we develop some new techniques that help show acceleration beyond quadratics, which is achieved by analyzing how the change of the Hessian at two consecutive time points affects the convergence speed. Based on our technical results, a class of Polyak-\L{}ojasiewicz (PL) optimization problems for which provable acceleration can be achieved via HB is identified. Moreover, our analysis demonstrates a benefit of adaptively setting the momentum parameter.
    Single-phase deep learning in cortico-cortical networks. (arXiv:2206.11769v1 [q-bio.NC])
    The error-backpropagation (backprop) algorithm remains the most common solution to the credit assignment problem in artificial neural networks. In neuroscience, it is unclear whether the brain could adopt a similar strategy to correctly modify its synapses. Recent models have attempted to bridge this gap while being consistent with a range of experimental observations. However, these models are either unable to effectively backpropagate error signals across multiple layers or require a multi-phase learning process, neither of which are reminiscent of learning in the brain. Here, we introduce a new model, bursting cortico-cortical networks (BurstCCN), which solves these issues by integrating known properties of cortical networks namely bursting activity, short-term plasticity (STP) and dendrite-targeting interneurons. BurstCCN relies on burst multiplexing via connection-type-specific STP to propagate backprop-like error signals within deep cortical networks. These error signals are encoded at distal dendrites and induce burst-dependent plasticity as a result of excitatory-inhibitory topdown inputs. First, we demonstrate that our model can effectively backpropagate errors through multiple layers using a single-phase learning process. Next, we show both empirically and analytically that learning in our model approximates backprop-derived gradients. Finally, we demonstrate that our model is capable of learning complex image classification tasks (MNIST and CIFAR-10). Overall, our results suggest that cortical features across sub-cellular, cellular, microcircuit and systems levels jointly underlie single-phase efficient deep learning in the brain.
    Exploiting Transliterated Words for Finding Similarity in Inter-Language News Articles using Machine Learning. (arXiv:2206.11860v1 [cs.CL])
    Finding similarities between two inter-language news articles is a challenging problem of Natural Language Processing (NLP). It is difficult to find similar news articles in a different language other than the native language of user, there is a need for a Machine Learning based automatic system to find the similarity between two inter-language news articles. In this article, we propose a Machine Learning model with the combination of English Urdu word transliteration which will show whether the English news article is similar to the Urdu news article or not. The existing approaches to find similarities has a major drawback when the archives contain articles of low-resourced languages like Urdu along with English news article. The existing approaches to find similarities has drawback when the archives contain low-resourced languages like Urdu along with English news articles. We used lexicon to link Urdu and English news articles. As Urdu language processing applications like machine translation, text to speech, etc are unable to handle English text at the same time so this research proposed technique to find similarities in English and Urdu news articles based on transliteration.
    LED: Latent Variable-based Estimation of Density. (arXiv:2206.11563v1 [cs.LG])
    Modern generative models are roughly divided into two main categories: (1) models that can produce high-quality random samples, but cannot estimate the exact density of new data points and (2) those that provide exact density estimation, at the expense of sample quality and compactness of the latent space. In this work we propose LED, a new generative model closely related to GANs, that allows not only efficient sampling but also efficient density estimation. By maximizing log-likelihood on the output of the discriminator, we arrive at an alternative adversarial optimization objective that encourages generated data diversity. This formulation provides insights into the relationships between several popular generative models. Additionally, we construct a flow-based generator that can compute exact probabilities for generated samples, while allowing low-dimensional latent variables as input. Our experimental results, on various datasets, show that our density estimator produces accurate estimates, while retaining good quality in the generated samples.
    Learning Agile Skills via Adversarial Imitation of Rough Partial Demonstrations. (arXiv:2206.11693v1 [cs.RO])
    Learning agile skills is one of the main challenges in robotics. To this end, reinforcement learning approaches have achieved impressive results. These methods require explicit task information in terms of a reward function or an expert that can be queried in simulation to provide a target control output, which limits their applicability. In this work, we propose a generative adversarial method for inferring reward functions from partial and potentially physically incompatible demonstrations for successful skill acquirement where reference or expert demonstrations are not easily accessible. Moreover, we show that by using a Wasserstein GAN formulation and transitions from demonstrations with rough and partial information as input, we are able to extract policies that are robust and capable of imitating demonstrated behaviors. Finally, the obtained skills such as a backflip are tested on an agile quadruped robot called Solo 8 and present faithful replication of hand-held human demonstrations.
    A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery. (arXiv:2206.11706v1 [eess.AS])
    Latent Dirichlet allocation (LDA) is widely used for unsupervised topic modelling on sets of documents. No temporal information is used in the model. However, there is often a relationship between the corresponding topics of consecutive tokens. In this paper, we present an extension to LDA that uses a Markov chain to model temporal information. We use this new model for acoustic unit discovery from speech. As input tokens, the model takes a discretised encoding of speech from a vector quantised (VQ) neural network with 512 codes. The goal is then to map these 512 VQ codes to 50 phone-like units (topics) in order to more closely resemble true phones. In contrast to the base LDA, which only considers how VQ codes co-occur within utterances (documents), the Markov chain LDA additionally captures how consecutive codes follow one another. This extension leads to an increase in cluster quality and phone segmentation results compared to the base LDA. Compared to a recent vector quantised neural network approach that also learns 50 units, the extended LDA model performs better in phone segmentation but worse in mutual information.
    A Multi-Policy Framework for Deep Learning-Based Fake News Detection. (arXiv:2206.11866v1 [cs.CL])
    Connectivity plays an ever-increasing role in modern society, with people all around the world having easy access to rapidly disseminated information. However, a more interconnected society enables the spread of intentionally false information. To mitigate the negative impacts of fake news, it is essential to improve detection methodologies. This work introduces Multi-Policy Statement Checker (MPSC), a framework that automates fake news detection by using deep learning techniques to analyze a statement itself and its related news articles, predicting whether it is seemingly credible or suspicious. The proposed framework was evaluated using four merged datasets containing real and fake news. Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU) and Bidirectional Encoder Representations from Transformers (BERT) models were trained to utilize both lexical and syntactic features, and their performance was evaluated. The obtained results demonstrate that a multi-policy analysis reliably identifies suspicious statements, which can be advantageous for fake news detection.
    Lifelong Learning Natural Language Processing Approach for Multilingual Data Classification. (arXiv:2206.11867v1 [cs.CL])
    The abundance of information in digital media, which in today's world is the main source of knowledge about current events for the masses, makes it possible to spread disinformation on a larger scale than ever before. Consequently, there is a need to develop novel fake news detection approaches capable of adapting to changing factual contexts and generalizing previously or concurrently acquired knowledge. To deal with this problem, we propose a lifelong learning-inspired approach, which allows for fake news detection in multiple languages and the mutual transfer of knowledge acquired in each of them. Both classical feature extractors, such as Term frequency-inverse document frequency or Latent Dirichlet Allocation, and integrated deep NLP (Natural Language Processing) BERT (Bidirectional Encoder Representations from Transformers) models paired with MLP (Multilayer Perceptron) classifier, were employed. The results of experiments conducted on two datasets dedicated to the fake news classification task (in English and Spanish, respectively), supported by statistical analysis, confirmed that utilization of additional languages could improve performance for traditional methods. Also, in some cases supplementing the deep learning method with classical ones can positively impact obtained results. The ability of models to generalize the knowledge acquired between the analyzed languages was also observed.
    Human-in-the-Loop Large-Scale Predictive Maintenance of Workstations. (arXiv:2206.11574v1 [cs.LG])
    Predictive maintenance (PdM) is the task of scheduling maintenance operations based on a statistical analysis of the system's condition. We propose a human-in-the-loop PdM approach in which a machine learning system predicts future problems in sets of workstations (computers, laptops, and servers). Our system interacts with domain experts to improve predictions and elicit their knowledge. In our approach, domain experts are included in the loop not only as providers of correct labels, as in traditional active learning, but as a source of explicit decision rule feedback. The system is automated and designed to be easily extended to novel domains, such as maintaining workstations of several organizations. In addition, we develop a simulator for reproducible experiments in a controlled environment and deploy the system in a large-scale case of real-life workstations PdM with thousands of workstations for dozens of companies.
    Deep Reinforcement Learning-Assisted Federated Learning for Robust Short-term Utility Demand Forecasting in Electricity Wholesale Markets. (arXiv:2206.11715v1 [cs.DC])
    Short-term load forecasting (STLF) plays a significant role in the operation of electricity trading markets. Considering the growing concern of data privacy, federated learning (FL) is increasingly adopted to train STLF models for utility companies (UCs) in recent research. Inspiringly, in wholesale markets, as it is not realistic for power plants (PPs) to access UCs' data directly, FL is definitely a feasible solution of obtaining an accurate STLF model for PPs. However, due to FL's distributed nature and intense competition among UCs, defects increasingly occur and lead to poor performance of the STLF model, indicating that simply adopting FL is not enough. In this paper, we propose a DRL-assisted FL approach, DEfect-AwaRe federated soft actor-critic (DearFSAC), to robustly train an accurate STLF model for PPs to forecast precise short-term utility electricity demand. Firstly. we design a STLF model based on long short-term memory (LSTM) using just historical load data and time data. Furthermore, considering the uncertainty of defects occurrence, a deep reinforcement learning (DRL) algorithm is adopted to assist FL by alleviating model degradation caused by defects. In addition, for faster convergence of FL training, an auto-encoder is designed for both dimension reduction and quality evaluation of uploaded models. In the simulations, we validate our approach on real data of Helsinki's UCs in 2019. The results show that DearFSAC outperforms all the other approaches no matter if defects occur or not.
    NovelCraft: A Dataset for Novelty Detection and Discovery in Open Worlds. (arXiv:2206.11736v1 [cs.CV])
    In order for artificial agents to perform useful tasks in changing environments, they must be able to both detect and adapt to novelty. However, visual novelty detection research often only evaluates on repurposed datasets such as CIFAR-10 originally intended for object classification. This practice restricts novelties to well-framed images of distinct object types. We suggest that new benchmarks are needed to represent the challenges of navigating an open world. Our new NovelCraft dataset contains multi-modal episodic data of the images and symbolic world-states seen by an agent completing a pogo-stick assembly task within a video game world. In some episodes, we insert novel objects that can impact gameplay. Novelty can vary in size, position, and occlusion within complex scenes. We benchmark state-of-the-art novelty detection and generalized category discovery models with a focus on comprehensive evaluation. Results suggest an opportunity for future research: models aware of task-specific costs of different types of mistakes could more effectively detect and adapt to novelty in open worlds.
    Self-Supervised Training with Autoencoders for Visual Anomaly Detection. (arXiv:2206.11723v1 [cs.CV])
    Deep convolutional autoencoders provide an effective tool for learning non-linear dimensionality reduction in an unsupervised way. Recently, they have been used for the task of anomaly detection in the visual domain. By optimising for the reconstruction error using anomaly-free examples, the common belief is that a trained network will have difficulties to reconstruct anomalous parts during the test phase. This is usually done by controlling the capacity of the network by either reducing the size of the bottleneck layer or enforcing sparsity constraints on its activations. However, neither of these techniques does explicitly penalise reconstruction of anomalous signals often resulting in a poor detection. We tackle this problem by adapting a self-supervised learning regime which allows to use discriminative information during training while regularising the model to focus on the data manifold by means of a modified reconstruction error resulting in an accurate detection. Unlike related approaches, the inference of the proposed method during training and prediction is very efficient processing the whole input image in one single step. Our experiments on the MVTec Anomaly Detection dataset demonstrate high recognition and localisation performance of the proposed method. On the texture-subset, in particular, our approach consistently outperforms a bunch of recent anomaly detection methods by a big margin.
    Urdu News Article Recommendation Model using Natural Language Processing Techniques. (arXiv:2206.11862v1 [cs.IR])
    There are several online newspapers in urdu but for the users it is difficult to find the content they are looking for because these most of them contain irrelevant data and most users did not get what they want to retrieve. Our proposed framework will help to predict Urdu news in the interests of users and reduce the users searching time for news. For this purpose, NLP techniques are used for pre-processing, and then TF-IDF with cosine similarity is used for gaining the highest similarity and recommended news on user preferences. Moreover, the BERT language model is also used for similarity, and by using the BERT model similarity increases as compared to TF-IDF so the approach works better with the BERT language model and recommends news to the user on their interest. The news is recommended when the similarity of the articles is above 60 percent.
    Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark. (arXiv:2206.11791v1 [cs.LG])
    We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image classification benchmark tasks. The resulting hardware implementations are quantized, configurable, spatial dataflow architectures tailored for speed and efficiency and introduce new generic optimizations and common workflows developed as a part of this work. The full workflow is presented from quantization-aware training to FPGA implementation. The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms. The resulting submissions achieve latencies as low as 20 $\mu$s and energy consumption as low as 30 $\mu$J per inference. We demonstrate how emerging ML benchmarks on heterogeneous hardware platforms can catalyze collaboration and the development of new techniques and more accessible tools.
    Video Diffusion Models. (arXiv:2204.03458v2 [cs.CV] UPDATED)
    Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/
    pyKT: A Python Library to Benchmark Deep Learning based Knowledge Tracing Models. (arXiv:2206.11460v1 [cs.LG])
    Knowledge tracing (KT) is the task of using students' historical learning interaction data to model their knowledge mastery over time so as to make predictions on their future interaction performance. Recently, remarkable progress has been made of using various deep learning techniques to solve the KT problem. However, the success behind deep learning based knowledge tracing (DLKT) approaches is still left somewhat mysterious and proper measurement and analysis of these DLKT approaches remain a challenge. First, data preprocessing procedures in existing works are often private and/or custom, which limits experimental standardization. Furthermore, existing DLKT studies often differ in terms of the evaluation protocol and are far away real-world educational contexts. To address these problems, we introduce a comprehensive python based benchmark platform, \textsc{pyKT}, to guarantee valid comparisons across DLKT methods via thorough evaluations. The \textsc{pyKT} library consists of a standardized set of integrated data preprocessing procedures on 7 popular datasets across different domains, and 10 frequently compared DLKT model implementations for transparent experiments. Results from our fine-grained and rigorous empirical KT studies yield a set of observations and suggestions for effective DLKT, e.g., wrong evaluation setting may cause label leakage that generally leads to performance inflation; and the improvement of many DLKT approaches is minimal compared to the very first DLKT model proposed by Piech et al. \cite{piech2015deep}. We have open sourced \textsc{pyKT} and our experimental results at \url{https://pykt.org/}. We welcome contributions from other research groups and practitioners.
    Explanatory causal effects for model agnostic explanations. (arXiv:2206.11529v1 [cs.LG])
    This paper studies the problem of estimating the contributions of features to the prediction of a specific instance by a machine learning model and the overall contribution of a feature to the model. The causal effect of a feature (variable) on the predicted outcome reflects the contribution of the feature to a prediction very well. A challenge is that most existing causal effects cannot be estimated from data without a known causal graph. In this paper, we define an explanatory causal effect based on a hypothetical ideal experiment. The definition brings several benefits to model agnostic explanations. First, explanations are transparent and have causal meanings. Second, the explanatory causal effect estimation can be data driven. Third, the causal effects provide both a local explanation for a specific prediction and a global explanation showing the overall importance of a feature in a predictive model. We further propose a method using individual and combined variables based on explanatory causal effects for explanations. We show the definition and the method work with experiments on some real-world data sets.
    Sufficient Statistic Memory Approximate Message Passing. (arXiv:2206.11674v1 [cs.IT])
    Approximate message passing (AMP) type algorithms have been widely used in the signal reconstruction of certain large random linear systems. A key feature of the AMP-type algorithms is that their dynamics can be correctly described by state evolution. However, state evolution does not necessarily guarantee the convergence of iterative algorithms. To solve the convergence problem of AMP-type algorithms in principle, this paper proposes a memory AMP (MAMP) under a sufficient statistic condition, named sufficient statistic MAMP (SS-MAMP). We show that the covariance matrices of SS-MAMP are L-banded and convergent. Given an arbitrary MAMP, we can construct the SS-MAMP by damping, which not only ensures the convergence, but also preserves the orthogonality, i.e., its dynamics can be correctly described by state evolution.
    Propagation with Adaptive Mask then Training for Node Classification on Attributed Networks. (arXiv:2206.10142v2 [cs.LG] UPDATED)
    Node classification on attributed networks is a semi-supervised task that is crucial for network analysis. By decoupling two critical operations in Graph Convolutional Networks (GCNs), namely feature transformation and neighborhood aggregation, some recent works of decoupled GCNs could support the information to propagate deeper and achieve advanced performance. However, they follow the traditional structure-aware propagation strategy of GCNs, making it hard to capture the attribute correlation of nodes and sensitive to the structure noise described by edges whose two endpoints belong to different categories. To address these issues, we propose a new method called the itshape Propagation with Adaptive Mask then Training (PAMT). The key idea is to integrate the attribute similarity mask into the structure-aware propagation process. In this way, PAMT could preserve the attribute correlation of adjacent nodes during the propagation and effectively reduce the influence of structure noise. Moreover, we develop an iterative refinement mechanism to update the similarity mask during the training process for improving the training performance. Extensive experiments on four real-world datasets demonstrate the superior performance and robustness of PAMT.
    Low-Rank Mirror-Prox for Nonsmooth and Low-Rank Matrix Optimization Problems. (arXiv:2206.11523v1 [math.OC])
    Low-rank and nonsmooth matrix optimization problems capture many fundamental tasks in statistics and machine learning. While significant progress has been made in recent years in developing efficient methods for \textit{smooth} low-rank optimization problems that avoid maintaining high-rank matrices and computing expensive high-rank SVDs, advances for nonsmooth problems have been slow paced. In this paper we consider standard convex relaxations for such problems. Mainly, we prove that under a \textit{strict complementarity} condition and under the relatively mild assumption that the nonsmooth objective can be written as a maximum of smooth functions, approximated variants of two popular \textit{mirror-prox} methods: the Euclidean \textit{extragradient method} and mirror-prox with \textit{matrix exponentiated gradient updates}, when initialized with a "warm-start", converge to an optimal solution with rate $O(1/t)$, while requiring only two \textit{low-rank} SVDs per iteration. Moreover, for the extragradient method we also consider relaxed versions of strict complementarity which yield a trade-off between the rank of the SVDs required and the radius of the ball in which we need to initialize the method. We support our theoretical results with empirical experiments on several nonsmooth low-rank matrix recovery tasks, demonstrating both the plausibility of the strict complementarity assumption, and the efficient convergence of our proposed low-rank mirror-prox variants.
    Prototype-Anchored Learning for Learning with Imperfect Annotations. (arXiv:2206.11602v1 [cs.LG])
    The success of deep neural networks greatly relies on the availability of large amounts of high-quality annotated data, which however are difficult or expensive to obtain. The resulting labels may be class imbalanced, noisy or human biased. It is challenging to learn unbiased classification models from imperfectly annotated datasets, on which we usually suffer from overfitting or underfitting. In this work, we thoroughly investigate the popular softmax loss and margin-based loss, and offer a feasible approach to tighten the generalization error bound by maximizing the minimal sample margin. We further derive the optimality condition for this purpose, which indicates how the class prototypes should be anchored. Motivated by theoretical analysis, we propose a simple yet effective method, namely prototype-anchored learning (PAL), which can be easily incorporated into various learning-based classification schemes to handle imperfect annotation. We verify the effectiveness of PAL on class-imbalanced learning and noise-tolerant learning by extensive experiments on synthetic and real-world datasets.
    Invariant Causal Mechanisms through Distribution Matching. (arXiv:2206.11646v1 [cs.LG])
    Learning representations that capture the underlying data generating process is a key problem for data efficient and robust use of neural networks. One key property for robustness which the learned representation should capture and which recently received a lot of attention is described by the notion of invariance. In this work we provide a causal perspective and new algorithm for learning invariant representations. Empirically we show that this algorithm works well on a diverse set of tasks and in particular we observe state-of-the-art performance on domain generalization, where we are able to significantly boost the score of existing models.
    Backward baselines: Is your model predicting the past?. (arXiv:2206.11673v1 [cs.LG])
    When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to which extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system's predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.
    On a class of geodesically convex optimization problems solved via Euclidean MM methods. (arXiv:2206.11426v1 [math.OC])
    We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality along with guarantees on iteration complexity. On the other hand, the split structure permits us to develop Euclidean Majorization-Minorization algorithms that help us bypass the need to compute expensive Riemannian operations such as exponential maps and parallel transport. We illustrate our results by specializing them to a few concrete optimization problems that have been previously studied in the machine learning literature. Ultimately, we hope our work helps motivate the broader search for mixed Euclidean-Riemannian optimization algorithms.
    On Pre-Training for Federated Learning. (arXiv:2206.11488v1 [cs.LG])
    In most of the literature on federated learning (FL), neural networks are initialized with random weights. In this paper, we present an empirical study on the effect of pre-training on FL. Specifically, we aim to investigate if pre-training can alleviate the drastic accuracy drop when clients' decentralized data are non-IID. We focus on FedAvg, the fundamental and most widely used FL algorithm. We found that pre-training does largely close the gap between FedAvg and centralized learning under non-IID data, but this does not come from alleviating the well-known model drifting problem in FedAvg's local training. Instead, how pre-training helps FedAvg is by making FedAvg's global aggregation more stable. When pre-training using real data is not feasible for FL, we propose a novel approach to pre-train with synthetic data. On various image datasets (including one for segmentation), our approach with synthetic pre-training leads to a notable gain, essentially a critical step toward scaling up federated learning for real-world applications.
    Investigation of stellar magnetic activity using variational autoencoder based on low-resolution spectroscopic survey. (arXiv:2206.07257v2 [astro-ph.SR] CROSS LISTED)
    We apply the variational autoencoder (VAE) to the LAMOST-K2 low-resolution spectra to detect the magnetic activity of the stars in the K2 field. After the training on the spectra of the selected inactive stars, the VAE model can efficiently generate the synthetic reference templates needed by the spectral subtraction procedure, without knowing any stellar parameters. Then we detect the peculiar spectral features, such as chromospheric emissions, strong nebular emissions and lithium absorptions, in our sample. We measure the emissions of the chromospheric activity indicators, H$\alpha$ and Ca$~{\rm {\small II}}$ infrared triplet (IRT) lines, to quantify the stellar magnetic activity. The excess emissions of H$\alpha$ and Ca$~{\rm {\small II}}$ IRT lines of the active stars are correlated well to the rotational periods and the amplitudes of light curves derived from the K2 photometry. We degrade the LAMOST spectra to simulate the slitless spectra of the planned China Space Station Telescope (CSST) and apply the VAE to the simulated data. For cool active stars, we reveal a good agreement between the equivalent widths (EWs) of H$\alpha$ line derived from the spectra with two resolutions. The result indicates the ability of identifying the magnetically active stars in the future CSST survey, which will deliver an unprecedented large database of low-resolution spectra as well as simultaneous multi-band photometry of stars.
    CGAR: Critic Guided Action Redistribution in Reinforcement Leaning. (arXiv:2206.11494v1 [cs.LG])
    Training a game-playing reinforcement learning agent requires multiple interactions with the environment. Ignorant random exploration may cause a waste of time and resources. It's essential to alleviate such waste. As discussed in this paper, under the settings of the off-policy actor critic algorithms, we demonstrate that the critic can bring more expected discounted rewards than or at least equal to the actor. Thus, the Q value predicted by the critic is a better signal to redistribute the action originally sampled from the policy distribution predicted by the actor. This paper introduces the novel Critic Guided Action Redistribution (CGAR) algorithm and tests it on the OpenAI MuJoCo tasks. The experimental results demonstrate that our method improves the sample efficiency and achieves state-of-the-art performance. Our code can be found at https://github.com/tairanhuang/CGAR.
    Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision. (arXiv:2206.11733v1 [cs.LG])
    Learning a diverse set of skills by interacting with an environment without any external supervision is an important challenge. In particular, obtaining a goal-conditioned agent that can reach any given state is useful in many applications. We propose a novel method for training such a goal-conditioned agent without any external rewards or any domain knowledge. We use random walk to train a reachability network that predicts the similarity between two states. This reachability network is then used in building goal memory containing past observations that are diverse and well-balanced. Finally, we train a goal-conditioned policy network with goals sampled from the goal memory and reward it by the reachability network and the goal memory. All the components are kept updated throughout training as the agent discovers and learns new goals. We apply our method to a continuous control navigation and robotic manipulation tasks.
    Gradual Domain Adaptation via Normalizing Flows. (arXiv:2206.11492v1 [stat.ML])
    Conventional domain adaptation methods do not work well when a large gap exists between the source and the target domain. Gradual domain adaptation is one of the approaches to address the problem by leveraging the intermediate domain, which gradually shifts from the source to the target domain. The previous work assumed that the number of the intermediate domains is large and the distance of the adjacent domains is small; hence, the gradual domain adaptation algorithm by self-training with unlabeled datasets was applicable. In practice, however, gradual self-training will fail because the number of the intermediate domains is limited, and the distance of the adjacent domains is large. We propose using normalizing flows to mitigate this problem while maintaining the framework of unsupervised domain adaptation. We generate pseudo intermediate domains from normalizing flows and then use them for gradual domain adaptation. We evaluate our method by experiments with real-world datasets and confirm that our proposed method mitigates the above explained problem and improves the classification performance.
    Rethinking Collaborative Metric Learning: Toward an Efficient Alternative without Negative Sampling. (arXiv:2206.11549v1 [cs.LG])
    The recently proposed Collaborative Metric Learning (CML) paradigm has aroused wide interest in the area of recommendation systems (RS) owing to its simplicity and effectiveness. Typically, the existing literature of CML depends largely on the \textit{negative sampling} strategy to alleviate the time-consuming burden of pairwise computation. However, in this work, by taking a theoretical analysis, we find that negative sampling would lead to a biased estimation of the generalization error. Specifically, we show that the sampling-based CML would introduce a bias term in the generalization bound, which is quantified by the per-user \textit{Total Variance} (TV) between the distribution induced by negative sampling and the ground truth distribution. This suggests that optimizing the sampling-based CML loss function does not ensure a small generalization error even with sufficiently large training data. Moreover, we show that the bias term will vanish without the negative sampling strategy. Motivated by this, we propose an efficient alternative without negative sampling for CML named \textit{Sampling-Free Collaborative Metric Learning} (SFCML), to get rid of the sampling bias in a practical sense. Finally, comprehensive experiments over seven benchmark datasets speak to the superiority of the proposed algorithm.
    Authentication of Copy Detection Patterns under Machine Learning Attacks: A Supervised Approach. (arXiv:2206.11793v1 [cs.CR])
    Copy detection patterns (CDP) are an attractive technology that allows manufacturers to defend their products against counterfeiting. The main assumption behind the protection mechanism of CDP is that these codes printed with the smallest symbol size (1x1) on an industrial printer cannot be copied or cloned with sufficient accuracy due to data processing inequality. However, previous works have shown that Machine Learning (ML) based attacks can produce high-quality fakes, resulting in decreased accuracy of authentication based on traditional feature-based authentication systems. While Deep Learning (DL) can be used as a part of the authentication system, to the best of our knowledge, none of the previous works has studied the performance of a DL-based authentication system against ML-based attacks on CDP with 1x1 symbol size. In this work, we study such a performance assuming a supervised learning (SL) setting.
    Community Recovery in the Geometric Block Model. (arXiv:2206.11303v1 [cs.SI])
    To capture inherent geometric features of many community detection problems, we propose to use a new random graph model of communities that we call a \emph{Geometric Block Model}. The geometric block model builds on the \emph{random geometric graphs} (Gilbert, 1961), one of the basic models of random graphs for spatial networks, in the same way that the well-studied stochastic block model builds on the Erd\H{o}s-R\'{en}yi random graphs. It is also a natural extension of random community models inspired by the recent theoretical and practical advancements in community detection. To analyze the geometric block model, we first provide new connectivity results for \emph{random annulus graphs} which are generalizations of random geometric graphs. The connectivity properties of geometric graphs have been studied since their introduction, and analyzing them has been difficult due to correlated edge formation. We then use the connectivity results of random annulus graphs to provide necessary and sufficient conditions for efficient recovery of communities for the geometric block model. We show that a simple triangle-counting algorithm to detect communities in the geometric block model is near-optimal. For this we consider two regimes of graph density. In the regime where the average degree of the graph grows logarithmically with number of vertices, we show that our algorithm performs extremely well, both theoretically and practically. In contrast, the triangle-counting algorithm is far from being optimum for the stochastic block model in the logarithmic degree regime. We also look at the regime where the average degree of the graph grows linearly with the number of vertices $n$, and hence to store the graph one needs $\Theta(n^2)$ memory. We show that our algorithm needs to store only $O(n \log n)$ edges in this regime to recover the latent communities.
    Modular Conformal Calibration. (arXiv:2206.11468v1 [cs.LG])
    Uncertainty estimates must be calibrated (i.e., accurate) and sharp (i.e., informative) in order to be useful. This has motivated a variety of methods for recalibration, which use held-out data to turn an uncalibrated model into a calibrated model. However, the applicability of existing methods is limited due to their assumption that the original model is also a probabilistic model. We introduce a versatile class of algorithms for recalibration in regression that we call Modular Conformal Calibration (MCC). This framework allows one to transform any regression model into a calibrated probabilistic model. The modular design of MCC allows us to make simple adjustments to existing algorithms that enable well-behaved distribution predictions. We also provide finite-sample calibration guarantees for MCC algorithms. Our framework recovers isotonic recalibration, conformal calibration, and conformal interval prediction, implying that our theoretical results apply to those methods as well. Finally, we conduct an empirical study of MCC on 17 regression datasets. Our results show that new algorithms designed in our framework achieve near-perfect calibration and improve sharpness relative to existing methods.
    Improved Regret for Differentially Private Exploration in Linear MDP. (arXiv:2202.01292v2 [cs.LG] UPDATED)
    We study privacy-preserving exploration in sequential decision-making for environments that rely on sensitive data such as medical records. In particular, we focus on solving the problem of reinforcement learning (RL) subject to the constraint of (joint) differential privacy in the linear MDP setting, where both dynamics and rewards are given by linear functions. Prior work on this problem due to Luyo et al. (2021) achieves a regret rate that has a dependence of $O(K^{3/5})$ on the number of episodes $K$. We provide a private algorithm with an improved regret rate with an optimal dependence of $O(\sqrt{K})$ on the number of episodes. The key recipe for our stronger regret guarantee is the adaptivity in the policy update schedule, in which an update only occurs when sufficient changes in the data are detected. As a result, our algorithm benefits from low switching cost and only performs $O(\log(K))$ updates, which greatly reduces the amount of privacy noise. Finally, in the most prevalent privacy regimes where the privacy parameter $\epsilon$ is a constant, our algorithm incurs negligible privacy cost -- in comparison with the existing non-private regret bounds, the additional regret due to privacy appears in lower-order terms.
    Patient Aware Active Learning for Fine-Grained OCT Classification. (arXiv:2206.11485v1 [eess.IV])
    This paper considers making active learning more sensible from a medical perspective. In practice, a disease manifests itself in different forms across patient cohorts. Existing frameworks have primarily used mathematical constructs to engineer uncertainty or diversity-based methods for selecting the most informative samples. However, such algorithms do not present themselves naturally as usable by the medical community and healthcare providers. Thus, their deployment in clinical settings is very limited, if any. For this purpose, we propose a framework that incorporates clinical insights into the sample selection process of active learning that can be incorporated with existing algorithms. Our medically interpretable active learning framework captures diverse disease manifestations from patients to improve generalization performance of OCT classification. After comprehensive experiments, we report that incorporating patient insights within the active learning framework yields performance that matches or surpasses five commonly used paradigms on two architectures with a dataset having imbalanced patient distributions. Also, the framework integrates within existing medical practices and thus can be used by healthcare providers.
    Linear Speedup in Personalized Collaborative Learning. (arXiv:2111.05968v4 [cs.LG] UPDATED)
    Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In this work, we formalize the personalized collaborative learning problem as a stochastic optimization of a task 0 while giving access to N related but different tasks 1,..., N. We provide convergence guarantees for two algorithms in this setting -- a popular collaboration method known as weighted gradient averaging, and a novel bias correction method -- and explore conditions under which we can achieve linear speedup w.r.t. the number of auxiliary tasks N. Further, we also empirically study their performance confirming our theoretical insights.
    Bayesian Nonparametrics for Offline Skill Discovery. (arXiv:2202.04675v3 [cs.LG] UPDATED)
    Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .
    Shilling Black-box Recommender Systems by Learning to Generate Fake User Profiles. (arXiv:2206.11433v1 [cs.IR])
    Due to the pivotal role of Recommender Systems (RS) in guiding customers towards the purchase, there is a natural motivation for unscrupulous parties to spoof RS for profits. In this paper, we study Shilling Attack where an adversarial party injects a number of fake user profiles for improper purposes. Conventional Shilling Attack approaches lack attack transferability (i.e., attacks are not effective on some victim RS models) and/or attack invisibility (i.e., injected profiles can be easily detected). To overcome these issues, we present Leg-UP, a novel attack model based on the Generative Adversarial Network. Leg-UP learns user behavior patterns from real users in the sampled ``templates'' and constructs fake user profiles. To simulate real users, the generator in Leg-UP directly outputs discrete ratings. To enhance attack transferability, the parameters of the generator are optimized by maximizing the attack performance on a surrogate RS model. To improve attack invisibility, Leg-UP adopts a discriminator to guide the generator to generate undetectable fake user profiles. Experiments on benchmarks have shown that Leg-UP exceeds state-of-the-art Shilling Attack methods on a wide range of victim RS models. The source code of our work is available at: https://github.com/XMUDM/ShillingAttack.
    Predicting the Geoeffectiveness of CMEs Using Machine Learning. (arXiv:2206.11472v1 [astro-ph.SR])
    Coronal mass ejections (CMEs) are the most geoeffective space weather phenomena, being associated with large geomagnetic storms, having the potential to cause disturbances to telecommunication, satellite network disruptions, power grid damages and failures. Thus, considering these storms' potential effects on human activities, accurate forecasts of the geoeffectiveness of CMEs are paramount. This work focuses on experimenting with different machine learning methods trained on white-light coronagraph datasets of close to sun CMEs, to estimate whether such a newly erupting ejection has the potential to induce geomagnetic activity. We developed binary classification models using logistic regression, K-Nearest Neighbors, Support Vector Machines, feed forward artificial neural networks, as well as ensemble models. At this time, we limited our forecast to exclusively use solar onset parameters, to ensure extended warning times. We discuss the main challenges of this task, namely the extreme imbalance between the number of geoeffective and ineffective events in our dataset, along with their numerous similarities and the limited number of available variables. We show that even in such conditions, adequate hit rates can be achieved with these models.
    A Geometric Method for Improved Uncertainty Estimation in Real-time. (arXiv:2206.11562v1 [cs.LG])
    Machine learning classifiers are probabilistic in nature, and thus inevitably involve uncertainty. Predicting the probability of a specific input to be correct is called uncertainty (or confidence) estimation and is crucial for risk management. Post-hoc model calibrations can improve models' uncertainty estimations without the need for retraining, and without changing the model. Our work puts forward a geometric-based approach for uncertainty estimation. Roughly speaking, we use the geometric distance of the current input from the existing training inputs as a signal for estimating uncertainty and then calibrate that signal (instead of the model's estimation) using standard post-hoc calibration techniques. We show that our method yields better uncertainty estimations than recently proposed approaches by extensively evaluating multiple datasets and models. In addition, we also demonstrate the possibility of performing our approach in near real-time applications. Our code is available at our Github https://github.com/NoSleepDeveloper/Geometric-Calibrator.
    Classical surrogates for quantum learning models. (arXiv:2206.11740v1 [quant-ph])
    The advent of noisy intermediate-scale quantum computers has put the search for possible applications to the forefront of quantum information science. One area where hopes for an advantage through near-term quantum computers are high is quantum machine learning, where variational quantum learning models based on parametrized quantum circuits are discussed. In this work, we introduce the concept of a classical surrogate, a classical model which can be efficiently obtained from a trained quantum learning model and reproduces its input-output relations. As inference can be performed classically, the existence of a classical surrogate greatly enhances the applicability of a quantum learning strategy. However, the classical surrogate also challenges possible advantages of quantum schemes. As it is possible to directly optimize the ansatz of the classical surrogate, they create a natural benchmark the quantum model has to outperform. We show that large classes of well-analyzed re-uploading models have a classical surrogate. We conducted numerical experiments and found that these quantum models show no advantage in performance or trainability in the problems we analyze. This leaves only generalization capability as possible point of quantum advantage and emphasizes the dire need for a better understanding of inductive biases of quantum learning models.
    Quant-BnB: A Scalable Branch-and-Bound Method for Optimal Decision Trees with Continuous Features. (arXiv:2206.11844v1 [cs.LG])
    Decision trees are one of the most useful and popular methods in the machine learning toolbox. In this paper, we consider the problem of learning optimal decision trees, a combinatorial optimization problem that is challenging to solve at scale. A common approach in the literature is to use greedy heuristics, which may not be optimal. Recently there has been significant interest in learning optimal decision trees using various approaches (e.g., based on integer programming, dynamic programming) -- to achieve computational scalability, most of these approaches focus on classification tasks with binary features. In this paper, we present a new discrete optimization method based on branch-and-bound (BnB) to obtain optimal decision trees. Different from existing customized approaches, we consider both regression and classification tasks with continuous features. The basic idea underlying our approach is to split the search space based on the quantiles of the feature distribution -- leading to upper and lower bounds for the underlying optimization problem along the BnB iterations. Our proposed algorithm Quant-BnB shows significant speedups compared to existing approaches for shallow optimal trees on various real datasets.
    $p$-Laplacian Based Graph Neural Networks. (arXiv:2111.07337v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have demonstrated superior performance for semi-supervised node classification on graphs, as a result of their ability to exploit node features and topological information simultaneously. However, most GNNs implicitly assume that the labels of nodes and their neighbors in a graph are the same or consistent, which does not hold in heterophilic graphs, where the labels of linked nodes are likely to differ. Hence, when the topology is non-informative for label prediction, ordinary GNNs may work significantly worse than simply applying multi-layer perceptrons (MLPs) on each node. To tackle the above problem, we propose a new $p$-Laplacian based GNN model, termed as $^p$GNN, whose message passing mechanism is derived from a discrete regularization framework and could be theoretically explained as an approximation of a polynomial graph filter defined on the spectral domain of $p$-Laplacians. The spectral analysis shows that the new message passing mechanism works simultaneously as low-pass and high-pass filters, thus making $^p$GNNs are effective on both homophilic and heterophilic graphs. Empirical studies on real-world and synthetic datasets validate our findings and demonstrate that $^p$GNNs significantly outperform several state-of-the-art GNN architectures on heterophilic benchmarks while achieving competitive performance on homophilic benchmarks. Moreover, $^p$GNNs can adaptively learn aggregation weights and are robust to noisy edges.
    Waypoint Generation in Row-based Crops with Deep Learning and Contrastive Clustering. (arXiv:2206.11623v1 [cs.RO])
    The development of precision agriculture has gradually introduced automation in the agricultural process to support and rationalize all the activities related to field management. In particular, service robotics plays a predominant role in this evolution by deploying autonomous agents able to navigate in fields while executing different tasks without the need for human intervention, such as monitoring, spraying and harvesting. In this context, global path planning is the first necessary step for every robotic mission and ensures that the navigation is performed efficiently and with complete field coverage. In this paper, we propose a learning-based approach to tackle waypoint generation for planning a navigation path for row-based crops, starting from a top-view map of the region-of-interest. We present a novel methodology for waypoint clustering based on a contrastive loss, able to project the points to a separable latent space. The proposed deep neural network can simultaneously predict the waypoint position and cluster assignment with two specialized heads in a single forward pass. The extensive experimentation on simulated and real-world images demonstrates that the proposed approach effectively solves the waypoint generation problem for both straight and curved row-based crops, overcoming the limitations of previous state-of-the-art methodologies.
    Neural Network-augmented Kalman Filtering for Robust Online Speech Dereverberation in Noisy Reverberant Environments. (arXiv:2204.02741v2 [eess.AS] UPDATED)
    In this paper, a neural network-augmented algorithm for noise-robust online dereverberation with a Kalman filtering variant of the weighted prediction error (WPE) method is proposed. The filter stochastic variations are predicted by a deep neural network (DNN) trained end-to-end using the filter residual error and signal characteristics. The presented framework allows for robust dereverberation on a single-channel noisy reverberant dataset similar to WHAMR!. The Kalman filtering WPE introduces distortions in the enhanced signal when predicting the filter variations from the residual error only, if the target speech power spectral density is not perfectly known and the observation is noisy. The proposed approach avoids these distortions by correcting the filter variations estimation in a data-driven way, increasing the robustness of the method to noisy scenarios. Furthermore, it yields a strong dereverberation and denoising performance compared to a DNN-supported recursive least squares variant of WPE, especially for highly noisy inputs.
    Backpropagation at the Infinitesimal Inference Limit of Energy-Based Models: Unifying Predictive Coding, Equilibrium Propagation, and Contrastive Hebbian Learning. (arXiv:2206.02629v2 [cs.LG] UPDATED)
    How the brain performs credit assignment is a fundamental unsolved problem in neuroscience. Many `biologically plausible' algorithms have been proposed, which compute gradients that approximate those computed by backpropagation (BP), and which operate in ways that more closely satisfy the constraints imposed by neural circuitry. Many such algorithms utilize the framework of energy-based models (EBMs), in which all free variables in the model are optimized to minimize a global energy function. However, in the literature, these algorithms exist in isolation and no unified theory exists linking them together. Here, we provide a comprehensive theory of the conditions under which EBMs can approximate BP, which lets us unify many of the BP approximation results in the literature (namely, predictive coding, equilibrium propagation, and contrastive Hebbian learning) and demonstrate that their approximation to BP arises from a simple and general mathematical property of EBMs at free-phase equilibrium. This property can then be exploited in different ways with different energy functions, and these specific choices yield a family of BP-approximating algorithms, which both includes the known results in the literature and can be used to derive new ones.
    Reachability analysis of neural networks using mixed monotonicity. (arXiv:2111.07683v3 [eess.SY] UPDATED)
    This paper presents a new reachability analysis approach to compute interval over-approximations of the output set of feedforward neural networks with input uncertainty. We adapt to neural networks an existing mixed-monotonicity method for the reachability analysis of dynamical systems and apply it to each partial network within the main network. This ensures that the intersection of the obtained results is the tightest interval over-approximation of the output of each layer that can be obtained using mixed-monotonicity on any partial network decomposition. Unlike other tools in the literature focusing on small classes of piecewise-affine or monotone activation functions, the main strength of our approach is its generality: it can handle neural networks with any Lipschitz-continuous activation function. In addition, the simplicity of our framework allows users to very easily add unimplemented activation functions, by simply providing the function, its derivative and the global argmin and argmax of the derivative. Our algorithm is compared to five other interval-based tools (Interval Bound Propagation, ReluVal, Neurify, VeriNet, CROWN) on both existing benchmarks and two sets of small and large randomly generated networks for four activation functions (ReLU, TanH, ELU, SiLU).
    A Framework for Learning to Request Rich and Contextually Useful Information from Humans. (arXiv:2110.08258v4 [cs.LG] UPDATED)
    When deployed, AI agents will encounter problems that are beyond their autonomous problem-solving capabilities. Leveraging human assistance can help agents overcome their inherent limitations and robustly cope with unfamiliar situations. We present a general interactive framework that enables an agent to request and interpret rich, contextually useful information from an assistant that has knowledge about the task and the environment. We demonstrate the practicality of our framework on a simulated human-assisted navigation problem. Aided with an assistance-requesting policy learned by our method, a navigation agent achieves up to a 7x improvement in success rate on tasks that take place in previously unseen environments, compared to fully autonomous behavior. We show that the agent can take advantage of different types of information depending on the context, and analyze the benefits and challenges of learning the assistance-requesting policy when the assistant can recursively decompose tasks into subtasks.
    Semantic Communications: Principles and Challenges. (arXiv:2201.01389v3 [cs.IT] UPDATED)
    Semantic communication, regarded as the breakthrough beyond the Shannon paradigm, aims at the successful transmission of semantic information conveyed by the source rather than the accurate reception of each single symbol or bit regardless of its meaning. This article provides an overview on semantic communications. After a brief review of Shannon information theory, we discuss semantic communications with theory, framework, and system design enabled by deep learning. Different from the symbol/bit error rate used for measuring conventional communication systems, performance metrics for semantic communications are also discussed. The article concludes with several open questions in semantic communications.
    Projection-free Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data. (arXiv:2206.11346v1 [math.OC])
    We study a projection-free conditional gradient-type algorithm for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we establish that the number of calls to the stochastic first-order oracle and the linear minimization oracle to obtain an appropriately defined $\epsilon$-stationary point, are of the order $\mathcal{O}(1/\epsilon^{2.5})$ and $\mathcal{O}(1/\epsilon^{5.5})$ respectively. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.
    Learning Representations for Control with Hierarchical Forward Models. (arXiv:2206.11396v1 [cs.LG])
    Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions. Instead, we propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns representations via a hierarchy of forward models that operate at varying magnitudes of step skipping while also learning to communicate between levels in the hierarchy. We evaluate HKSL in a suite of 30 robotic control tasks and find that HKSL either reaches higher episodic returns or converges to maximum performance more quickly than several current baselines. Also, we find that levels in HKSL's hierarchy can learn to specialize in long- or short-term consequences of agent actions, thereby providing the downstream control policy with more informative representations. Finally, we determine that communication channels between hierarchy levels organize information based on both sides of the communication process, which improves sample efficiency.
    Optimizing Two-way Partial AUC with an End-to-end Framework. (arXiv:2206.11655v1 [cs.LG])
    The Area Under the ROC Curve (AUC) is a crucial metric for machine learning, which evaluates the average performance over all possible True Positive Rates (TPRs) and False Positive Rates (FPRs). Based on the knowledge that a skillful classifier should simultaneously embrace a high TPR and a low FPR, we turn to study a more general variant called Two-way Partial AUC (TPAUC), where only the region with $\mathsf{TPR} \ge \alpha, \mathsf{FPR} \le \beta$ is included in the area. Moreover, recent work shows that the TPAUC is essentially inconsistent with the existing Partial AUC metrics where only the FPR range is restricted, opening a new problem to seek solutions to leverage high TPAUC. Motivated by this, we present the first trial in this paper to optimize this new metric. The critical challenge along this course lies in the difficulty of performing gradient-based optimization with end-to-end stochastic training, even with a proper choice of surrogate loss. To address this issue, we propose a generic framework to construct surrogate optimization problems, which supports efficient end-to-end training with deep learning. Moreover, our theoretical analyses show that: 1) the objective function of the surrogate problems will achieve an upper bound of the original problem under mild conditions, and 2) optimizing the surrogate problems leads to good generalization performance in terms of TPAUC with a high probability. Finally, empirical studies over several benchmark datasets speak to the efficacy of our framework.
    Recursive Reinforcement Learning. (arXiv:2206.11430v1 [cs.LG])
    Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As state-of-the-art reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and error-prone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with call-stack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Q-learning -- a model-free RL algorithm for RMDPs -- and prove that it converges for finite, single-exit and deterministic multi-exit RMDPs under mild assumptions.
    Input-agnostic Certified Group Fairness via Gaussian Parameter Smoothing. (arXiv:2206.11423v1 [cs.LG])
    Only recently, researchers attempt to provide classification algorithms with provable group fairness guarantees. Most of these algorithms suffer from harassment caused by the requirement that the training and deployment data follow the same distribution. This paper proposes an input-agnostic certified group fairness algorithm, FairSmooth, for improving the fairness of classification models while maintaining the remarkable prediction accuracy. A Gaussian parameter smoothing method is developed to transform base classifiers into their smooth versions. An optimal individual smooth classifier is learnt for each group with only the data regarding the group and an overall smooth classifier for all groups is generated by averaging the parameters of all the individual smooth ones. By leveraging the theory of nonlinear functional analysis, the smooth classifiers are reformulated as output functions of a Nemytskii operator. Theoretical analysis is conducted to derive that the Nemytskii operator is smooth and induces a Frechet differentiable smooth manifold. We theoretically demonstrate that the smooth manifold has a global Lipschitz constant that is independent of the domain of the input data, which derives the input-agnostic certified group fairness.
    Prevent Car Accidents by Using AI. (arXiv:2206.11381v1 [cs.LG])
    Transportation facilities are becoming more developed as society develops, and people's travel demand is increasing, but so are the traffic safety issues that arise as a result. And car accidents are a major issue all over the world. The cost of traffic fatalities and driver injuries has a significant impact on society. The use of machine learning techniques in the field of traffic accidents is becoming increasingly popular. Machine learning classifiers are used instead of traditional data mining techniques to produce better results and accuracy. As a result, this project conducts research on existing work related to accident prediction using machine learning. We will use crash data and weather data to train machine learning models to predict crash severity and reduce crashes.
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v1 [math.ST])
    Bi-stochastic normalization of kernelized graph affinity matrix provides an alternative normalization scheme for graph Laplacian methods in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations in practice. This paper proves the convergence of the bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data up to an additional error term proportional to the boundedness of mutual inner-products of the noise vectors. Our analysis suggests that, under the setting being considered in this paper, not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate. Motivated by the analysis, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination, and apply to simulated manifold data both clean and with outlier noise. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to outlier noise.
    Program Targeting with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan. (arXiv:2206.11400v1 [econ.GN])
    Can mobile phone data improve program targeting? By combining rich survey data from a "big push" anti-poverty program in Afghanistan with detailed mobile phone logs from program beneficiaries, we study the extent to which machine learning methods can accurately differentiate ultra-poor households eligible for program benefits from ineligible households. We show that machine learning methods leveraging mobile phone data can identify ultra-poor households nearly as accurately as survey-based measures of consumption and wealth; and that combining survey-based measures with mobile phone data produces classifications more accurate than those based on a single data source.
    Reinforcement Learning under Partial Observability Guided by Learned Environment Models. (arXiv:2206.11708v1 [cs.LG])
    In practical applications, we can rarely assume full observability of a system's environment, despite such knowledge being important for determining a reactive control system's precise interaction with its environment. Therefore, we propose an approach for reinforcement learning (RL) in partially observable environments. While assuming that the environment behaves like a partially observable Markov decision process with known discrete actions, we assume no knowledge about its structure or transition probabilities. Our approach combines Q-learning with IoAlergia, a method for learning Markov decision processes (MDP). By learning MDP models of the environment from episodes of the RL agent, we enable RL in partially observable domains without explicit, additional memory to track previous interactions for dealing with ambiguities stemming from partial observability. We instead provide RL with additional observations in the form of abstract environment states by simulating new experiences on learned environment models to track the explored states. In our evaluation, we report on the validity of our approach and its promising performance in comparison to six state-of-the-art deep RL techniques with recurrent neural networks and fixed memory.
    Safe Reinforcement Learning Using Robust Control Barrier Functions. (arXiv:2110.05415v2 [eess.SY] UPDATED)
    Reinforcement Learning (RL) has been shown to be effective in many scenarios. However, it typically requires the exploration of a sufficiently large number of state-action pairs, some of which may be unsafe. Consequently, its application to safety-critical systems remains a challenge. An increasingly common approach to address safety involves the addition of a safety layer that projects the RL actions onto a safe set of actions. In turn, a difficulty for such frameworks is how to effectively couple RL with the safety layer to improve the learning performance. In this paper, we frame safety as a differentiable robust-control-barrier-function layer in a model-based RL framework. Moreover, we also propose an approach to modularly learn the underlying reward-driven task, independent of safety constraints. We demonstrate that this approach both ensures safety and effectively guides exploration during training in a range of experiments, including zero-shot transfer when the reward is learned in a modular way.
    On the Parameterization and Initialization of Diagonal State Space Models. (arXiv:2206.11893v1 [cs.LG])
    State space models (SSM) have recently been shown to be very effective as a deep learning layer as a promising alternative to sequence models such as RNNs, CNNs, or Transformers. The first version to show this potential was the S4 model, which is particularly effective on tasks involving long-range dependencies by using a prescribed state matrix called the HiPPO matrix. While this has an interpretable mathematical mechanism for modeling long dependencies, it introduces a custom representation and algorithm that can be difficult to implement. On the other hand, a recent variant of S4 called DSS showed that restricting the state matrix to be fully diagonal can still preserve the performance of the original model when using a specific initialization based on approximating S4's matrix. This work seeks to systematically understand how to parameterize and initialize such diagonal state space models. While it follows from classical results that almost all SSMs have an equivalent diagonal form, we show that the initialization is critical for performance. We explain why DSS works mathematically, by showing that the diagonal restriction of S4's matrix surprisingly recovers the same kernel in the limit of infinite state dimension. We also systematically describe various design choices in parameterizing and computing diagonal SSMs, and perform a controlled empirical study ablating the effects of these choices. Our final model S4D is a simple diagonal version of S4 whose kernel computation requires just 2 lines of code and performs comparably to S4 in almost all settings, with state-of-the-art results for image, audio, and medical time-series domains, and averaging 85\% on the Long Range Arena benchmark.
    RetroGraph: Retrosynthetic Planning with Graph Search. (arXiv:2206.11477v1 [cs.AI])
    Retrosynthetic planning, which aims to find a reaction pathway to synthesize a target molecule, plays an important role in chemistry and drug discovery. This task is usually modeled as a search problem. Recently, data-driven methods have attracted many research interests and shown promising results for retrosynthetic planning. We observe that the same intermediate molecules are visited many times in the searching process, and they are usually independently treated in previous tree-based methods (e.g., AND-OR tree search, Monte Carlo tree search). Such redundancies make the search process inefficient. We propose a graph-based search policy that eliminates the redundant explorations of any intermediate molecules. As searching over a graph is more complicated than over a tree, we further adopt a graph neural network to guide the search over graphs. Meanwhile, our method can search a batch of targets together in the graph and remove the inter-target duplication in the tree-based search methods. Experimental results on two datasets demonstrate the effectiveness of our method. Especially on the widely used USPTO benchmark, we improve the search success rate to 99.47%, advancing previous state-of-the-art performance for 2.6 points.
    A generalised form for a homogeneous population of structures using an overlapping mixture of Gaussian processes. (arXiv:2206.11683v1 [cs.LG])
    Reductions in natural frequency are often used as a damage indicator for structural health monitoring (SHM) purposes. However, fluctuations in operational and environmental conditions, changes in boundary conditions, and slight differences among nominally-identical structures can also affect stiffness, producing frequency changes that mimic or mask damage. This variability has limited the practical implementation and generalisation of SHM technologies. The aim of this work is to investigate the effects of normal variation, and to identify methods that account for the resulting uncertainty. This work considers vibration data collected from a set of four healthy full-scale composite helicopter blades. The blades were nominally-identical but distinct, and slight differences in material properties and geometry among the blades caused significant variability in the frequency response functions, which presented as four separate trajectories across the input space. In this paper, an overlapping mixture of Gaussian processes (OMGP), was used to generate labels and quantify the uncertainty of normal-condition frequency response data from the helicopter blades. Using a population-based approach, the OMGP model provided a generic representation, called a form, to characterise the normal condition of the blades. Additional simulated data were then compared against the form and evaluated for damage using a marginal-likelihood novelty index.
    Remote Sensing Change Detection (Segmentation) using Denoising Diffusion Probabilistic Models. (arXiv:2206.11892v1 [cs.CV])
    Human civilization has an increasingly powerful influence on the earth system, and earth observations are an invaluable tool for assessing and mitigating the negative impacts. To this end, observing precisely defined changes on Earth's surface is essential, and we propose an effective way to achieve this goal. Notably, our change detection (CD)/ segmentation method proposes a novel way to incorporate the millions of off-the-shelf, unlabeled, remote sensing images available through different earth observation programs into the training process through denoising diffusion probabilistic models. We first leverage the information from these off-the-shelf, uncurated, and unlabeled remote sensing images by using a pre-trained denoising diffusion probabilistic model and then employ the multi-scale feature representations from the diffusion model decoder to train a lightweight CD classifier to detect precise changes. The experiments performed on four publically available CD datasets show that the proposed approach achieves remarkably better results than the state-of-the-art methods in F1, IoU, and overall accuracy. Code and pre-trained models are available at: https://github.com/wgcban/ddpm-cd
    Context-based Virtual Adversarial Training for Text Classification with Noisy Labels. (arXiv:2206.11851v1 [cs.CL])
    Deep neural networks (DNNs) have a high capacity to completely memorize noisy labels given sufficient training time, and its memorization, unfortunately, leads to performance degradation. Recently, virtual adversarial training (VAT) attracts attention as it could further improve the generalization of DNNs in semi-supervised learning. The driving force behind VAT is to prevent the models from overfitting data points by enforcing consistency between the inputs and the perturbed inputs. This strategy could be helpful in learning from noisy labels if it prevents neural models from learning noisy samples while encouraging the models to generalize clean samples. In this paper, we propose context-based virtual adversarial training (ConVAT) to prevent a text classifier from overfitting to noisy labels. Unlike the previous works, the proposed method performs the adversarial training at the context level rather than the inputs. It makes the classifier not only learn its label but also its contextual neighbors, which alleviates the learning from noisy labels by preserving contextual semantics on each data point. We conduct extensive experiments on four text classification datasets with two types of label noises. Comprehensive experimental results clearly show that the proposed method works quite well even with extremely noisy settings.
    Improving decision-making via risk-based active learning: Probabilistic discriminative classifiers. (arXiv:2206.11616v1 [cs.LG])
    Gaining the ability to make informed decisions on operation and maintenance of structures provides motivation for the implementation of structural health monitoring (SHM) systems. However, descriptive labels for measured data corresponding to health-states of the monitored system are often unavailable. This issue limits the applicability of fully-supervised machine learning paradigms for the development of statistical classifiers to be used in decision-support in SHM systems. One approach to dealing with this problem is risk-based active learning. In such an approach, data-label querying is guided according to the expected value of perfect information for incipient data points. For risk-based active learning in SHM, the value of information is evaluated with respect to a maintenance decision process, and the data-label querying corresponds to the inspection of a structure to determine its health state. In the context of SHM, risk-based active learning has only been considered for generative classifiers. The current paper demonstrates several advantages of using an alternative type of classifier -- discriminative models. Using the Z24 Bridge dataset as a case study, it is shown that discriminative classifiers have benefits, in the context of SHM decision-support, including improved robustness to sampling bias, and reduced expenditure on structural inspections.
    Inductive Conformal Prediction: A Straightforward Introduction with Examples in Python. (arXiv:2206.11810v1 [stat.ML])
    Inductive Conformal Prediction (ICP) is a set of distribution-free and model agnostic algorithms devised to predict with a user-defined confidence with coverage guarantee. Instead of having \textit{point predictions}, i.e., a real number in the case of regression or a single class in multi class classification, models calibrated using ICP output an interval or a set of classes, respectively. ICP takes special importance in high-risk settings where we want the real output to belong to the prediction set with high probability. As an example, a classification model might output that given a magnetic resonance image a patient has no latent diseases to report. However, this model output was based on the most likely class, the second most likely class might tell that the patient has a 15\% chance of brain tumor or other severe disease and therefore further exams should be conducted. Using ICP is therefore way more informative and we believe that should be the standard way of producing forecasts. This paper is a hands-on introduction, this means that we will provide examples as we introduce the theory.
    On the Generalizability and Predictability of Recommender Systems. (arXiv:2206.11886v1 [cs.IR])
    While other areas of machine learning have seen more and more automation, designing a high-performing recommender system still requires a high level of human effort. Furthermore, recent work has shown that modern recommender system algorithms do not always improve over well-tuned baselines. A natural follow-up question is, "how do we choose the right algorithm for a new dataset and performance metric?" In this work, we start by giving the first large-scale study of recommender system approaches by comparing 18 algorithms and 100 sets of hyperparameters across 85 datasets and 315 metrics. We find that the best algorithms and hyperparameters are highly dependent on the dataset and performance metric, however, there are also strong correlations between the performance of each algorithm and various meta-features of the datasets. Motivated by these findings, we create RecZilla, a meta-learning approach to recommender systems that uses a model to predict the best algorithm and hyperparameters for new, unseen datasets. By using far more meta-training data than prior work, RecZilla is able to substantially reduce the level of human involvement when faced with a new recommender system application. We not only release our code and pretrained RecZilla models, but also all of our raw experimental results, so that practitioners can train a RecZilla model for their desired performance metric: https://github.com/naszilla/reczilla.
    Optimization paper production through digitalization by developing an assistance system for machine operators including quality forecast: a concept. (arXiv:2206.11581v1 [eess.SY])
    Nowadays cross-industry ranging challenges include the reduction of greenhouse gas emission and enabling a circular economy. However, the production of paper from waste paper is still a highly resource intensive task, especially in terms of energy consumption. While paper machines produce a lot of data, we have identified a lack of utilization of it and implement a concept using an operator assistance system and state-of-the-art machine learning techniques, e.g., classification, forecasting and alarm flood handling algorithms, to support daily operator tasks. Our main objective is to provide situation-specific knowledge to machine operators utilizing available data. We expect this will result in better adjusted parameters and therefore a lower footprint of the paper machines.
    Few-Shot Non-Parametric Learning with Deep Latent Variable Model. (arXiv:2206.11573v1 [cs.LG])
    Most real-world problems that machine learning algorithms are expected to solve face the situation with 1) unknown data distribution; 2) little domain-specific knowledge; and 3) datasets with limited annotation. We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV), a learning framework for any dataset with abundant unlabeled data but very few labeled ones. By only training a generative model in an unsupervised way, the framework utilizes the data distribution to build a compressor. Using a compressor-based distance metric derived from Kolmogorov complexity, together with few labeled data, NPC-LV classifies without further training. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime and even outperform semi-supervised learning methods on CIFAR-10. We demonstrate how and when negative evidence lowerbound (nELBO) can be used as an approximate compressed length for classification. By revealing the correlation between compression rate and classification accuracy, we illustrate that under NPC-LV, the improvement of generative models can enhance downstream classification accuracy.
    Functional Nonlinear Learning. (arXiv:2206.11424v1 [stat.ML])
    Using representations of functional data can be more convenient and beneficial in subsequent statistical models than direct observations. These representations, in a lower-dimensional space, extract and compress information from individual curves. The existing representation learning approaches in functional data analysis usually use linear mapping in parallel to those from multivariate analysis, e.g., functional principal component analysis (FPCA). However, functions, as infinite-dimensional objects, sometimes have nonlinear structures that cannot be uncovered by linear mapping. Linear methods will be more overwhelmed given multivariate functional data. For that matter, this paper proposes a functional nonlinear learning (FunNoL) method to sufficiently represent multivariate functional data in a lower-dimensional feature space. Furthermore, we merge a classification model for enriching the ability of representations in predicting curve labels. Hence, representations from FunNoL can be used for both curve reconstruction and classification. Additionally, we have endowed the proposed model with the ability to address the missing observation problem as well as to further denoise observations. The resulting representations are robust to observations that are locally disturbed by uncontrollable random noises. We apply the proposed FunNoL method to several real data sets and show that FunNoL can achieve better classifications than FPCA, especially in the multivariate functional data setting. Simulation studies have shown that FunNoL provides satisfactory curve classification and reconstruction regardless of data sparsity.
    EFFGAN: Ensembles of fine-tuned federated GANs. (arXiv:2206.11682v1 [cs.LG])
    Generative adversarial networks have proven to be a powerful tool for learning complex and high-dimensional data distributions, but issues such as mode collapse have been shown to make it difficult to train them. This is an even harder problem when the data is decentralized over several clients in a federated learning setup, as problems such as client drift and non-iid data make it hard for federated averaging to converge. In this work, we study the task of how to learn a data distribution when training data is heterogeneously decentralized over clients and cannot be shared. Our goal is to sample from this distribution centrally, while the data never leaves the clients. We show using standard benchmark image datasets that existing approaches fail in this setting, experiencing so-called client drift when the local number of epochs becomes to large. We thus propose a novel approach we call EFFGAN: Ensembles of fine-tuned federated GANs. Being an ensemble of local expert generators, EFFGAN is able to learn the data distribution over all clients and mitigate client drift. It is able to train with a large number of local epochs, making it more communication efficient than previous works.
    Utilizing Expert Features for Contrastive Learning of Time-Series Representations. (arXiv:2206.11517v1 [cs.LG])
    We present an approach that incorporates expert knowledge for time-series representation learning. Our method employs expert features to replace the commonly used data transformations in previous contrastive learning approaches. We do this since time-series data frequently stems from the industrial or medical field where expert features are often available from domain experts, while transformations are generally elusive for time-series data. We start by proposing two properties that useful time-series representations should fulfill and show that current representation learning approaches do not ensure these properties. We therefore devise ExpCLR, a novel contrastive learning approach built on an objective that utilizes expert features to encourage both properties for the learned representation. Finally, we demonstrate on three real-world time-series datasets that ExpCLR surpasses several state-of-the-art methods for both unsupervised and semi-supervised representation learning.
    Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation. (arXiv:2206.11489v1 [cs.LG])
    We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
    GACT: Activation Compressed Training for General Architectures. (arXiv:2206.11357v1 [cs.LG])
    Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss.
    A Framework for Understanding Model Extraction Attack and Defense. (arXiv:2206.11480v1 [cs.LG])
    The privacy of machine learning models has become a significant concern in many emerging Machine-Learning-as-a-Service applications, where prediction services based on well-trained models are offered to users via pay-per-query. The lack of a defense mechanism can impose a high risk on the privacy of the server's model since an adversary could efficiently steal the model by querying only a few `good' data points. The interplay between a server's defense and an adversary's attack inevitably leads to an arms race dilemma, as commonly seen in Adversarial Machine Learning. To study the fundamental tradeoffs between model utility from a benign user's view and privacy from an adversary's view, we develop new metrics to quantify such tradeoffs, analyze their theoretical properties, and develop an optimization problem to understand the optimal adversarial attack and defense strategies. The developed concepts and theory match the empirical findings on the `equilibrium' between privacy and utility. In terms of optimization, the key ingredient that enables our results is a unified representation of the attack-defense problem as a min-max bi-level problem. The developed results will be demonstrated by examples and experiments.
    Few-shot Long-Tailed Bird Audio Recognition. (arXiv:2206.11260v1 [cs.SD])
    It is easier to hear birds than see them. However, they still play an essential role in nature and are excellent indicators of deteriorating environmental quality and pollution. Recent advances in Machine Learning and Convolutional Neural Networks allow us to process continuous audio data to detect and classify bird sounds. This technology can assist researchers in monitoring bird populations' status and trends and ecosystems' biodiversity. We propose a sound detection and classification pipeline to analyze complex soundscape recordings and identify birdcalls in the background. Our method learns from weak labels and few data and acoustically recognizes the bird species. Our solution achieved 18th place of 807 teams at the BirdCLEF 2022 Challenge hosted on Kaggle.
    Synthetic Data-Based Simulators for Recommender Systems: A Survey. (arXiv:2206.11338v1 [cs.IR])
    This survey aims at providing a comprehensive overview of the recent trends in the field of modeling and simulation (M&S) of interactions between users and recommender systems and applications of the M&S to the performance improvement of industrial recommender engines. We start with the motivation behind the development of frameworks implementing the simulations -- simulators -- and the usage of them for training and testing recommender systems of different types (including Reinforcement Learning ones). Furthermore, we provide a new consistent classification of existing simulators based on their functionality, approbation, and industrial effectiveness and moreover make a summary of the simulators found in the research literature. Besides other things, we discuss the building blocks of simulators: methods for synthetic data (user, item, user-item responses) generation, methods for what-if experimental analysis, methods and datasets used for simulation quality evaluation (including the methods that monitor and/or close possible simulation-to-reality gaps), and methods for summarization of experimental simulation results. Finally, this survey considers emerging topics and open problems in the field.
    Measurement and applications of position bias in a marketplace search engine. (arXiv:2206.11720v1 [cs.IR])
    Search engines intentionally influence user behavior by picking and ranking the list of results. Users engage with the highest results both because of their prominent placement and because they are typically the most relevant documents. Search engine ranking algorithms need to identify relevance while incorporating the influence of the search engine itself. This paper describes our efforts at Thumbtack to understand the impact of ranking, including the empirical results of a randomization program. In the context of a consumer marketplace we discuss practical details of model choice, experiment design, bias calculation, and machine learning model adaptation. We include a novel discussion of how ranking bias may not only affect labels, but also model features. The randomization program led to improved models, motivated internal scenario analysis, and enabled user-facing scenario tooling.
    Context matters for fairness -- a case study on the effect of spatial distribution shifts. (arXiv:2206.11436v1 [cs.LG])
    With the ever growing involvement of data-driven AI-based decision making technologies in our daily social lives, the fairness of these systems is becoming a crucial phenomenon. However, an important and often challenging aspect in utilizing such systems is to distinguish validity for the range of their application especially under distribution shifts, i.e., when a model is deployed on data with different distribution than the training set. In this paper, we present a case study on the newly released American Census datasets, a reconstruction of the popular Adult dataset, to illustrate the importance of context for fairness and show how remarkably can spatial distribution shifts affect predictive- and fairness-related performance of a model. The problem persists for fairness-aware learning models with the effects of context-specific fairness interventions differing across the states and different population groups. Our study suggests that robustness to distribution shifts is necessary before deploying a model to another context.
    Learning Towards the Largest Margins. (arXiv:2206.11589v1 [cs.CV])
    One of the main challenges for feature representation in deep learning-based classification is the design of appropriate loss functions that exhibit strong discriminative power. The classical softmax loss does not explicitly encourage discriminative learning of features. A popular direction of research is to incorporate margins in well-established losses in order to enforce extra intra-class compactness and inter-class separability, which, however, were developed through heuristic means, as opposed to rigorous mathematical principles. In this work, we attempt to address this limitation by formulating the principled optimization objective as learning towards the largest margins. Specifically, we firstly define the class margin as the measure of inter-class separability, and the sample margin as the measure of intra-class compactness. Accordingly, to encourage discriminative representation of features, the loss function should promote the largest possible margins for both classes and samples. Furthermore, we derive a generalized margin softmax loss to draw general conclusions for the existing margin-based losses. Not only does this principled framework offer new perspectives to understand and interpret existing margin-based losses, but it also provides new insights that can guide the design of new tools, including sample margin regularization and largest margin softmax loss for the class-balanced case, and zero-centroid regularization for the class-imbalanced case. Experimental results demonstrate the effectiveness of our strategy on a variety of tasks, including visual classification, imbalanced classification, person re-identification, and face verification.
    Positive-Unlabeled Learning with Adversarial Data Augmentation for Knowledge Graph Completion. (arXiv:2205.00904v3 [cs.LG] UPDATED)
    Most real-world knowledge graphs (KG) are far from complete and comprehensive. This problem has motivated efforts in predicting the most plausible missing facts to complete a given KG, i.e., knowledge graph completion (KGC). However, existing KGC methods suffer from two main issues, 1) the false negative issue, i.e., the sampled negative training instances may include potential true facts; and 2) the data sparsity issue, i.e., true facts account for only a tiny part of all possible facts. To this end, we propose positive-unlabeled learning with adversarial data augmentation (PUDA) for KGC. In particular, PUDA tailors positive-unlabeled risk estimator for the KGC task to deal with the false negative issue. Furthermore, to address the data sparsity issue, PUDA achieves a data augmentation strategy by unifying adversarial training and positive-unlabeled learning under the positive-unlabeled minimax game. Extensive experimental results on real-world benchmark datasets demonstrate the effectiveness and compatibility of our proposed method.
    FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search. (arXiv:2206.11408v1 [cs.LG])
    Approximate K-Nearest Neighbor Search (AKNNS) has now become ubiquitous in modern applications, for example, as a fast search procedure with two tower deep learning models. Graph-based methods for AKNNS in particular have received great attention due to their superior performance. These methods rely on greedy graph search to traverse the data points as embedding vectors in a database. Under this greedy search scheme, we make a key observation: many distance computations do not influence search updates so these computations can be approximated without hurting performance. As a result, we propose FINGER, a fast inference method to achieve efficient graph search. FINGER approximates the distance function by estimating angles between neighboring residual vectors with low-rank bases and distribution matching. The approximated distance can be used to bypass unnecessary computations, which leads to faster searches. Empirically, accelerating a popular graph-based method named HNSW by FINGER is shown to outperform existing graph-based methods by 20%-60% across different benchmark datasets.
    Neural Implicit Manifold Learning for Topology-Aware Generative Modelling. (arXiv:2206.11267v1 [stat.ML])
    Natural data observed in $\mathbb{R}^n$ is often constrained to an $m$-dimensional manifold $\mathcal{M}$, where $m < n$. Current generative models represent this manifold by mapping an $m$-dimensional latent variable through a neural network $f_\theta: \mathbb{R}^m \to \mathbb{R}^n$. Such procedures, which we call pushforward models, incur a straightforward limitation: manifolds cannot in general be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. To remedy this problem, we propose to model $\mathcal{M}$ as a neural implicit manifold: the set of zeros of a neural network. To learn the data distribution within $\mathcal{M}$, we introduce constrained energy-based models, which use a constrained variant of Langevin dynamics to train and sample within the learned manifold. The resulting model can be manipulated with an arithmetic of manifolds which allows practitioners to take unions and intersections of model manifolds. In experiments on synthetic and natural data, we show that constrained EBMs can learn manifold-supported distributions with complex topologies more accurately than pushforward models.
    Disentangling representations in Restricted Boltzmann Machines without adversaries. (arXiv:2206.11600v1 [cs.LG])
    A goal of unsupervised machine learning is to disentangle representations of complex high-dimensional data, allowing for interpreting the significant latent factors of variation in the data as well as for manipulating them to generate new data with desirable features. These methods often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct specific data information (labels). We propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated on the MNIST dataset, the two-dimensional Ising model, and taxonomy of protein families. In addition, we show how our framework allows for computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations.
    Offline RL for Natural Language Generation with Implicit Language Q Learning. (arXiv:2206.11871v1 [cs.CL])
    Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL motivated method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility optimization framework of traditional RL algorithms with supervised learning's ability to leverage existing data and its simplicity and stability. Our method, based on dynamic programming, employs a blend of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing utility. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as an example of toxic speech or not.
    Prompt Injection: Parameterization of Fixed Inputs. (arXiv:2206.11349v1 [cs.LG])
    Recent works have shown that attaching prompts to the input is effective at conditioning Language Models (LM) to perform specific tasks. However, prompts are always included in the input text during inference, thus incurring substantial computational and memory overhead. Also, there is currently no straightforward method of utilizing prompts that are longer than the maximum input length of the LMs without incurring additional costs during inference. We propose Prompt Injection (PI), a novel formulation of injecting the prompt into the parameters of an LM to be an efficient alternative to attaching fixed prompts to the input. We show that in scenarios with long fixed prompts, PI can be up to 280 times more efficient in terms of total FLOPs than previous approaches. We further explore methodologies for PI and show promising results in persona-dependent conversation, semantic parsing, and zero-shot learning with task instructions. Through these explorations, we show that PI can be a promising direction for conditioning language models, especially in scenarios with long and fixed prompts.  ( 2 min )
    Latent Policies for Adversarial Imitation Learning. (arXiv:2206.11299v1 [cs.LG])
    This paper considers learning robot locomotion and manipulation tasks from expert demonstrations. Generative adversarial imitation learning (GAIL) trains a discriminator that distinguishes expert from agent transitions, and in turn use a reward defined by the discriminator output to optimize a policy generator for the agent. This generative adversarial training approach is very powerful but depends on a delicate balance between the discriminator and the generator training. In high-dimensional problems, the discriminator training may easily overfit or exploit associations with task-irrelevant features for transition classification. A key insight of this work is that performing imitation learning in a suitable latent task space makes the training process stable, even in challenging high-dimensional problems. We use an action encoder-decoder model to obtain a low-dimensional latent action space and train a LAtent Policy using Adversarial imitation Learning (LAPAL). The encoder-decoder model can be trained offline from state-action pairs to obtain a task-agnostic latent action representation or online, simultaneously with the discriminator and generator training, to obtain a task-aware latent action representation. We demonstrate that LAPAL training is stable, with near-monotonic performance improvement, and achieves expert performance in most locomotion and manipulation tasks, while a GAIL baseline converges slower and does not achieve expert performance in high-dimensional environments.  ( 2 min )
    The ArtBench Dataset: Benchmarking Generative Models with Artworks. (arXiv:2206.11404v1 [cs.CV])
    We introduce ArtBench-10, the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmarking artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets. Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures. We provide three versions of the dataset with different resolutions ($32\times32$, $256\times256$, and original image size), formatted in a way that is easy to be incorporated by popular machine learning frameworks. We also conduct extensive benchmarking experiments using representative image synthesis models with ArtBench-10 and present in-depth analysis. The dataset is available at https://github.com/liaopeiyuan/artbench under a Fair Use license.  ( 2 min )
    Attention-aware contrastive learning for predicting T cell receptor-antigen binding specificity. (arXiv:2206.11255v1 [q-bio.QM])
    It has been verified that only a small fraction of the neoantigens presented by MHC class I molecules on the cell surface can elicit T cells. The limitation can be attributed to the binding specificity of T cell receptor (TCR) to peptide-MHC complex (pMHC). Computational prediction of T cell binding to neoantigens is an challenging and unresolved task. In this paper, we propose an attentive-mask contrastive learning model, ATMTCR, for inferring TCR-antigen binding specificity. For each input TCR sequence, we used Transformer encoder to transform it to latent representation, and then masked a proportion of residues guided by attention weights to generate its contrastive view. Pretraining on large-scale TCR CDR3 sequences, we verified that contrastive learning significantly improved the prediction performance of TCR binding to peptide-MHC complex (pMHC). Beyond the detection of important amino acids and their locations in the TCR sequence, our model can also extracted high-order semantic information underlying the TCR-antigen binding specificity. Comparison experiments were conducted on two independent datasets, our method achieved better performance than other existing algorithms. Moreover, we effectively identified important amino acids and their positional preferences through attention weights, which indicated the interpretability of our proposed model.  ( 2 min )
    Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer. (arXiv:2206.11326v1 [cs.LG])
    In many real-world applications, reinforcement learning (RL) agents might have to solve multiple tasks, each one typically modeled via a reward function. If reward functions are expressed linearly, and the agent has previously learned a set of policies for different tasks, successor features (SFs) can be exploited to combine such policies and identify reasonable solutions for new problems. However, the identified solutions are not guaranteed to be optimal. We introduce a novel algorithm that addresses this limitation. It allows RL agents to combine existing policies and directly identify optimal policies for arbitrary new problems, without requiring any further interactions with the environment. We first show (under mild assumptions) that the transfer learning problem tackled by SFs is equivalent to the problem of learning to optimize multiple objectives in RL. We then introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set. We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks, without requiring any additional training samples. We empirically show that our method outperforms state-of-the-art competing algorithms both in discrete and continuous domains under value function approximation.  ( 2 min )
    Optimally Weighted Ensembles of Regression Models: Exact Weight Optimization and Applications. (arXiv:2206.11263v1 [cs.LG])
    Automated model selection is often proposed to users to choose which machine learning model (or method) to apply to a given regression task. In this paper, we show that combining different regression models can yield better results than selecting a single ('best') regression model, and outline an efficient method that obtains optimally weighted convex linear combination from a heterogeneous set of regression models. More specifically, in this paper, a heuristic weight optimization, used in a preceding conference paper, is replaced by an exact optimization algorithm using convex quadratic programming. We prove convexity of the quadratic programming formulation for the straightforward formulation and for a formulation with weighted data points. The novel weight optimization is not only (more) exact but also more efficient. The methods we develop in this paper are implemented and made available via github-open source. They can be executed on commonly available hardware and offer a transparent and easy to interpret interface. The results indicate that the approach outperforms model selection methods on a range of data sets, including data sets with mixed variable type from drug discovery applications.  ( 2 min )
    Efficient Adaptive Federated Optimization of Federated Learning for IoT. (arXiv:2206.11448v1 [cs.LG])
    The proliferation of the Internet of Things (IoT) and widespread use of devices with sensing, computing, and communication capabilities have motivated intelligent applications empowered by artificial intelligence. The classical artificial intelligence algorithms require centralized data collection and processing which are challenging in realistic intelligent IoT applications due to growing data privacy concerns and distributed datasets. Federated Learning (FL) has emerged as a distributed privacy-preserving learning framework that enables IoT devices to train global model through sharing model parameters. However, inefficiency due to frequent parameters transmissions significantly reduce FL performance. Existing acceleration algorithms consist of two main type including local update considering trade-offs between communication and computation and parameter compression considering trade-offs between communication and precision. Jointly considering these two trade-offs and adaptively balancing their impacts on convergence have remained unresolved. To solve the problem, this paper proposes a novel efficient adaptive federated optimization (EAFO) algorithm to improve efficiency of FL, which minimizes the learning error via jointly considering two variables including local update and parameter compression and enables FL to adaptively adjust the two variables and balance trade-offs among computation, communication and precision. The experiment results illustrate that comparing with state-of-the-art algorithms, the proposed EAFO can achieve higher accuracies faster.  ( 2 min )
    Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation. (arXiv:2206.11403v1 [cs.LG])
    It has been a long-standing dream to design artificial agents that explore their environment efficiently via intrinsic motivation, similar to how children perform curious free play. Despite recent advances in intrinsically motivated reinforcement learning (RL), sample-efficient exploration in object manipulation scenarios remains a significant challenge as most of the relevant information lies in the sparse agent-object and object-object interactions. In this paper, we propose to use structured world models to incorporate relational inductive biases in the control loop to achieve sample-efficient and interaction-rich exploration in compositional multi-object environments. By planning for future novelty inside structured world models, our method generates free-play behavior that starts to interact with objects early on and develops more complex behavior over time. Instead of using models only to compute intrinsic rewards, as commonly done, our method showcases that the self-reinforcing cycle between good models and good exploration also opens up another avenue: zero-shot generalization to downstream tasks via model-based planning. After the entirely intrinsic task-agnostic exploration phase, our method solves challenging downstream tasks such as stacking, flipping, pick & place, and throwing that generalizes to unseen numbers and arrangements of objects without any additional training.
    Stochastic Langevin Differential Inclusions with Applications to Machine Learning. (arXiv:2206.11533v1 [math.OC])
    Stochastic differential equations of Langevin-diffusion form have received significant recent, thanks to their foundational role in both Bayesian sampling algorithms and optimization in machine learning. In the latter, they serve as a conceptual model of the stochastic gradient flow in training over-parametrized models. However, the literature typically assumes smoothness of the potential, whose gradient is the drift term. Nevertheless, there are many problems, for which the potential function is not continuously differentiable, and hence the drift is not Lipschitz-continuous everywhere. This is exemplified by robust losses and Rectified Linear Units in regression problems. In this paper, we show some foundational results regarding the flow and asymptotic properties of Langevin-type Stochastic Differential Inclusions under assumptions appropriate to the machine-learning settings. In particular, we show strong existence of the solution, as well as asymptotic minimization of the canonical Free Energy Functional.
  • Open

    $\ell_{\infty}$-Bounds of the MLE in the BTL Model under General Comparison Graphs. (arXiv:2110.10825v2 [math.ST] UPDATED)
    The Bradley-Terry-Luce (BTL) model is a popular statistical approach for estimating the global ranking of a collection of items using pairwise comparisons. To ensure accurate ranking, it is essential to obtain precise estimates of the model parameters in the $\ell_{\infty}$-loss. The difficulty of this task depends crucially on the topology of the pairwise comparison graph over the given items. However, beyond very few well-studied cases, such as the complete and Erd\"os-R\'enyi comparison graphs, little is known about the performance of the maximum likelihood estimator MLE) of the BTL model parameters in the $\ell_{\infty}$-loss under more general graph topologies. In this paper, we derive novel, general upper bounds on the $\ell_{\infty}$ estimation error of the BTL MLE that depend explicitly on the algebraic connectivity of the comparison graph, the maximal performance gap across items and the sample complexity. We demonstrate that the derived bounds perform well and in some cases are sharper compared to known results obtained using different loss functions and more restricted assumptions and graph topologies. We carefully compare our results to Yan et al. (2012), which is closest in spirit to our work. We further provide minimax lower bounds under $\ell_{\infty}$-error that nearly match the upper bounds over a class of sufficiently regular graph topologies. Finally, we study the implications of our $\ell_{\infty}$-bounds for efficient (offline) tournament design. We illustrate and discuss our findings through various examples and simulations.
    How causal machine learning can leverage marketing strategies: Assessing and improving the performance of a coupon campaign. (arXiv:2204.10820v2 [econ.GN] UPDATED)
    We apply causal machine learning algorithms to assess the causal effect of a marketing intervention, namely a coupon campaign, on the sales of a retailer. Besides assessing the average impacts of different types of coupons, we also investigate the heterogeneity of causal effects across different subgroups of customers, e.g., between clients with relatively high vs. low prior purchases. Finally, we use optimal policy learning to determine (in a data-driven way) which customer groups should be targeted by the coupon campaign in order to maximize the marketing intervention's effectiveness in terms of sales. We find that only two out of the five coupon categories examined, namely coupons applicable to the product categories of drugstore items and other food, have a statistically significant positive effect on retailer sales. The assessment of group average treatment effects reveals substantial differences in the impact of coupon provision across customer groups, particularly across customer groups as defined by prior purchases at the store, with drugstore coupons being particularly effective among customers with high prior purchases and other food coupons among customers with low prior purchases. Our study provides a use case for the application of causal machine learning in business analytics to evaluate the causal impact of specific firm policies (like marketing campaigns) for decision support.
    Bayesian Nonparametrics for Offline Skill Discovery. (arXiv:2202.04675v3 [cs.LG] UPDATED)
    Skills or low-level policies in reinforcement learning are temporally extended actions that can speed up learning and enable complex behaviours. Recent work in offline reinforcement learning and imitation learning has proposed several techniques for skill discovery from a set of expert trajectories. While these methods are promising, the number K of skills to discover is always a fixed hyperparameter, which requires either prior knowledge about the environment or an additional parameter search to tune it. We first propose a method for offline learning of options (a particular skill framework) exploiting advances in variational inference and continuous relaxations. We then highlight an unexplored connection between Bayesian nonparametrics and offline skill discovery, and show how to obtain a nonparametric version of our model. This version is tractable thanks to a carefully structured approximate posterior with a dynamically-changing number of options, removing the need to specify K. We also show how our nonparametric extension can be applied in other skill frameworks, and empirically demonstrate that our method can outperform state-of-the-art offline skill learning algorithms across a variety of environments. Our code is available at https://github.com/layer6ai-labs/BNPO .
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v3 [stat.ML] UPDATED)
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.
    Sequential Importance Sampling for Hybrid Model Bayesian Inference to Support Bioprocess Mechanism Learning and Robust Control. (arXiv:2205.02410v3 [stat.ML] UPDATED)
    Driven by the critical needs of biomanufacturing 4.0, we introduce a probabilistic knowledge graph hybrid model characterizing the risk- and science-based understanding of bioprocess mechanisms. It can faithfully capture the important properties, including nonlinear reactions, partially observed state, and nonstationary dynamics. Given very limited real process observations, we derive a posterior distribution quantifying model estimation uncertainty. To avoid the evaluation of intractable likelihoods, Approximate Bayesian Computation sampling with Sequential Monte Carlo (ABC-SMC) is utilized to approximate the posterior distribution. Under high stochastic and model uncertainties, it is computationally expensive to match output trajectories. Therefore, we create a linear Gaussian dynamic Bayesian network (LG-DBN) auxiliary likelihood-based ABC-SMC approach. Through matching the summary statistics driven through LG-DBN likelihood that can capture critical interactions and variations, the proposed algorithm can accelerate hybrid model inference, support process monitoring, and facilitate mechanism learning and robust control.
    Do More Negative Samples Necessarily Hurt in Contrastive Learning?. (arXiv:2205.01789v2 [cs.LG] UPDATED)
    Recent investigations in noise contrastive estimation suggest, both empirically as well as theoretically, that while having more "negative samples" in the contrastive loss improves downstream classification performance initially, beyond a threshold, it hurts downstream performance due to a "collision-coverage" trade-off. But is such a phenomenon inherent in contrastive learning? We show in a simple theoretical setting, where positive pairs are generated by sampling from the underlying latent class (introduced by Saunshi et al. (ICML 2019)), that the downstream performance of the representation optimizing the (population) contrastive loss in fact does not degrade with the number of negative samples. Along the way, we give a structural characterization of the optimal representation in our framework, for noise contrastive estimation. We also provide empirical support for our theoretical results on CIFAR-10 and CIFAR-100 datasets.
    Subexponential-Time Algorithms for Sparse PCA. (arXiv:1907.11635v3 [math.ST] UPDATED)
    We study the computational cost of recovering a unit-norm sparse principal component $x \in \mathbb{R}^n$ planted in a random matrix, in either the Wigner or Wishart spiked model (observing either $W + \lambda xx^\top$ with $W$ drawn from the Gaussian orthogonal ensemble, or $N$ independent samples from $\mathcal{N}(0, I_n + \beta xx^\top)$, respectively). Prior work has shown that when the signal-to-noise ratio ($\lambda$ or $\beta\sqrt{N/n}$, respectively) is a small constant and the fraction of nonzero entries in the planted vector is $\|x\|_0 / n = \rho$, it is possible to recover $x$ in polynomial time if $\rho \lesssim 1/\sqrt{n}$. While it is possible to recover $x$ in exponential time under the weaker condition $\rho \ll 1$, it is believed that polynomial-time recovery is impossible unless $\rho \lesssim 1/\sqrt{n}$. We investigate the precise amount of time required for recovery in the "possible but hard" regime $1/\sqrt{n} \ll \rho \ll 1$ by exploring the power of subexponential-time algorithms, i.e., algorithms running in time $\exp(n^\delta)$ for some constant $\delta \in (0,1)$. For any $1/\sqrt{n} \ll \rho \ll 1$, we give a recovery algorithm with runtime roughly $\exp(\rho^2 n)$, demonstrating a smooth tradeoff between sparsity and runtime. Our family of algorithms interpolates smoothly between two existing algorithms: the polynomial-time diagonal thresholding algorithm and the $\exp(\rho n)$-time exhaustive search algorithm. Furthermore, by analyzing the low-degree likelihood ratio, we give rigorous evidence suggesting that the tradeoff achieved by our algorithms is optimal.
    Identify treatment effect patterns for personalised decisions. (arXiv:1906.06080v2 [stat.ME] UPDATED)
    In personalised decision making, evidence is required to determine whether an action (treatment) is suitable for an individual. Such evidence can be obtained by modelling treatment effect heterogeneity in subgroups. The existing interpretable modelling methods take a top-down approach to search for subgroups with heterogeneous treatment effects and they may miss the most specific and relevant context for an individual. In this paper, we design a \emph{Treatment effect pattern (TEP)} to represent treatment effect heterogeneity in data. To achieve an interpretable presentation of TEPs, we use a local causal structure around the outcome to explicitly show how those important variables are used in modelling. We also derive a formula for unbiasedly estimating the \emph{Conditional Average Causal Effect (CATE)} using the local structure in our problem setting. In the discovery process, we aim at minimising heterogeneity within each subgroup represented by a pattern. We propose a bottom-up search algorithm to discover the most specific patterns fitting individual circumstances the best for personalised decision making. Experiments show that the proposed method models treatment effect heterogeneity better than three other existing tree based methods in synthetic and real world data sets.
    Approximation Benefits of Policy Gradient Methods with Aggregated States. (arXiv:2007.11684v3 [cs.LG] UPDATED)
    Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $\epsilon$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $\epsilon/(1-\gamma)$, where $\gamma$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.
    Matrix-wise $\ell_0$-constrained Sparse Nonnegative Least Squares. (arXiv:2011.11066v4 [cs.LG] UPDATED)
    Nonnegative least squares problems with multiple right-hand sides (MNNLS) arise in models that rely on additive linear combinations. In particular, they are at the core of most nonnegative matrix factorization algorithms and have many applications. The nonnegativity constraint is known to naturally favor sparsity, that is, solutions with few non-zero entries. However, it is often useful to further enhance this sparsity, as it improves the interpretability of the results and helps reducing noise, which leads to the sparse MNNLS problem. In this paper, as opposed to most previous works that enforce sparsity column- or row-wise, we first introduce a novel formulation for sparse MNNLS, with a matrix-wise sparsity constraint. Then, we present a two-step algorithm to tackle this problem. The first step divides sparse MNNLS in subproblems, one per column of the original problem. It then uses different algorithms to produce, either exactly or approximately, a Pareto front for each subproblem, that is, to produce a set of solutions representing different tradeoffs between reconstruction error and sparsity. The second step selects solutions among these Pareto fronts in order to build a sparsity-constrained matrix that minimizes the reconstruction error. We perform experiments on facial and hyperspectral images, and we show that our proposed two-step approach provides more accurate results than state-of-the-art sparse coding heuristics applied both column-wise and globally.
    Chasing Convex Bodies and Functions with Black-Box Advice. (arXiv:2206.11780v1 [cs.LG])
    We consider the problem of convex function chasing with black-box advice, where an online decision-maker aims to minimize the total cost of making and switching between decisions in a normed vector space, aided by black-box advice such as the decisions of a machine-learned algorithm. The decision-maker seeks cost comparable to the advice when it performs well, known as $\textit{consistency}$, while also ensuring worst-case $\textit{robustness}$ even when the advice is adversarial. We first consider the common paradigm of algorithms that switch between the decisions of the advice and a competitive algorithm, showing that no algorithm in this class can improve upon 3-consistency while staying robust. We then propose two novel algorithms that bypass this limitation by exploiting the problem's convexity. The first, INTERP, achieves $(\sqrt{2}+\epsilon)$-consistency and $\mathcal{O}(\frac{C}{\epsilon^2})$-robustness for any $\epsilon > 0$, where $C$ is the competitive ratio of an algorithm for convex function chasing or a subclass thereof. The second, BDINTERP, achieves $(1+\epsilon)$-consistency and $\mathcal{O}(\frac{CD}{\epsilon})$-robustness when the problem has bounded diameter $D$. Further, we show that BDINTERP achieves near-optimal consistency-robustness trade-off for the special case where cost functions are $\alpha$-polyhedral.
    Hermite Polynomial Features for Private Data Generation. (arXiv:2106.05042v4 [cs.LG] UPDATED)
    Kernel mean embedding is a useful tool to represent and compare probability measures. Despite its usefulness, kernel mean embedding considers infinite-dimensional features, which are challenging to handle in the context of differentially private data generation. A recent work proposes to approximate the kernel mean embedding of data distribution using finite-dimensional random features, which yields analytically tractable sensitivity. However, the number of required random features is excessively high, often ten thousand to a hundred thousand, which worsens the privacy-accuracy trade-off. To improve the trade-off, we propose to replace random features with Hermite polynomial features. Unlike the random features, the Hermite polynomial features are ordered, where the features at the low orders contain more information on the distribution than those at the high orders. Hence, a relatively low order of Hermite polynomial features can more accurately approximate the mean embedding of the data distribution compared to a significantly higher number of random features. As demonstrated on several tabular and image datasets, Hermite polynomial features seem better suited for private data generation than random Fourier features.
    Factorization of the Partial Covariance in Singly-Connected Path Diagrams. (arXiv:2002.05226v6 [stat.ME] UPDATED)
    We extend path analysis by showing that, for a singly-connected path diagram, the partial covariance of two random variables factorizes over the nodes and edges in the path between the variables. This result allows us to determine the contribution of each node and edge to the partial covariance. It also allows us to show that Simpson's paradox cannot occur in singly-connected path diagrams.
    Diagnosing and Fixing Manifold Overfitting in Deep Generative Models. (arXiv:2204.07172v2 [stat.ML] UPDATED)
    Likelihood-based, or explicit, deep generative models use neural networks to construct flexible high-dimensional densities. This formulation directly contradicts the manifold hypothesis, which states that observed data lies on a low-dimensional manifold embedded in high-dimensional ambient space. In this paper we investigate the pathologies of maximum-likelihood training in the presence of this dimensionality mismatch. We formally prove that degenerate optima are achieved wherein the manifold itself is learned but not the distribution on it, a phenomenon we call manifold overfitting. We propose a class of two-step procedures consisting of a dimensionality reduction step followed by maximum-likelihood density estimation, and prove that they recover the data-generating distribution in the nonparametric regime, thus avoiding manifold overfitting. We also show that these procedures enable density estimation on the manifolds learned by implicit models, such as generative adversarial networks, hence addressing a major shortcoming of these models. Several recently proposed methods are instances of our two-step procedures; we thus unify, extend, and theoretically justify a large class of models.
    $p$-Laplacian Based Graph Neural Networks. (arXiv:2111.07337v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have demonstrated superior performance for semi-supervised node classification on graphs, as a result of their ability to exploit node features and topological information simultaneously. However, most GNNs implicitly assume that the labels of nodes and their neighbors in a graph are the same or consistent, which does not hold in heterophilic graphs, where the labels of linked nodes are likely to differ. Hence, when the topology is non-informative for label prediction, ordinary GNNs may work significantly worse than simply applying multi-layer perceptrons (MLPs) on each node. To tackle the above problem, we propose a new $p$-Laplacian based GNN model, termed as $^p$GNN, whose message passing mechanism is derived from a discrete regularization framework and could be theoretically explained as an approximation of a polynomial graph filter defined on the spectral domain of $p$-Laplacians. The spectral analysis shows that the new message passing mechanism works simultaneously as low-pass and high-pass filters, thus making $^p$GNNs are effective on both homophilic and heterophilic graphs. Empirical studies on real-world and synthetic datasets validate our findings and demonstrate that $^p$GNNs significantly outperform several state-of-the-art GNN architectures on heterophilic benchmarks while achieving competitive performance on homophilic benchmarks. Moreover, $^p$GNNs can adaptively learn aggregation weights and are robust to noisy edges.
    Fock State-enhanced Expressivity of Quantum Machine Learning Models. (arXiv:2107.05224v2 [quant-ph] UPDATED)
    The data-embedding process is one of the bottlenecks of quantum machine learning, potentially negating any quantum speedups. In light of this, more effective data-encoding strategies are necessary. We propose a photonic-based bosonic data-encoding scheme that embeds classical data points using fewer encoding layers and circumventing the need for nonlinear optical components by mapping the data points into the high-dimensional Fock space. The expressive power of the circuit can be controlled via the number of input photons. Our work shed some light on the unique advantages offers by quantum photonics on the expressive power of quantum machine learning models. By leveraging the photon-number dependent expressive power, we propose three different noisy intermediate-scale quantum-compatible binary classification methods with different scaling of required resources suitable for different supervised classification tasks.
    Wasserstein t-SNE. (arXiv:2205.07531v2 [cs.LG] UPDATED)
    Scientific datasets often have hierarchical structure: for example, in surveys, individual participants (samples) might be grouped at a higher level (units) such as their geographical region. In these settings, the interest is often in exploring the structure on the unit level rather than on the sample level. Units can be compared based on the distance between their means, however this ignores the within-unit distribution of samples. Here we develop an approach for exploratory analysis of hierarchical datasets using the Wasserstein distance metric that takes into account the shapes of within-unit distributions. We use t-SNE to construct 2D embeddings of the units, based on the matrix of pairwise Wasserstein distances between them. The distance matrix can be efficiently computed by approximating each unit with a Gaussian distribution, but we also provide a scalable method to compute exact Wasserstein distances. We use synthetic data to demonstrate the effectiveness of our Wasserstein t-SNE, and apply it to data from the 2017 German parliamentary election, considering polling stations as samples and voting districts as units. The resulting embedding uncovers meaningful structure in the data.
    Gradual Domain Adaptation via Normalizing Flows. (arXiv:2206.11492v1 [stat.ML])
    Conventional domain adaptation methods do not work well when a large gap exists between the source and the target domain. Gradual domain adaptation is one of the approaches to address the problem by leveraging the intermediate domain, which gradually shifts from the source to the target domain. The previous work assumed that the number of the intermediate domains is large and the distance of the adjacent domains is small; hence, the gradual domain adaptation algorithm by self-training with unlabeled datasets was applicable. In practice, however, gradual self-training will fail because the number of the intermediate domains is limited, and the distance of the adjacent domains is large. We propose using normalizing flows to mitigate this problem while maintaining the framework of unsupervised domain adaptation. We generate pseudo intermediate domains from normalizing flows and then use them for gradual domain adaptation. We evaluate our method by experiments with real-world datasets and confirm that our proposed method mitigates the above explained problem and improves the classification performance.  ( 2 min )
    Modular Conformal Calibration. (arXiv:2206.11468v1 [cs.LG])
    Uncertainty estimates must be calibrated (i.e., accurate) and sharp (i.e., informative) in order to be useful. This has motivated a variety of methods for recalibration, which use held-out data to turn an uncalibrated model into a calibrated model. However, the applicability of existing methods is limited due to their assumption that the original model is also a probabilistic model. We introduce a versatile class of algorithms for recalibration in regression that we call Modular Conformal Calibration (MCC). This framework allows one to transform any regression model into a calibrated probabilistic model. The modular design of MCC allows us to make simple adjustments to existing algorithms that enable well-behaved distribution predictions. We also provide finite-sample calibration guarantees for MCC algorithms. Our framework recovers isotonic recalibration, conformal calibration, and conformal interval prediction, implying that our theoretical results apply to those methods as well. Finally, we conduct an empirical study of MCC on 17 regression datasets. Our results show that new algorithms designed in our framework achieve near-perfect calibration and improve sharpness relative to existing methods.  ( 2 min )
    Bayesian model calibration for block copolymer self-assembly: Likelihood-free inference and expected information gain computation via measure transport. (arXiv:2206.11343v1 [physics.comp-ph])
    We consider the Bayesian calibration of models describing the phenomenon of block copolymer (BCP) self-assembly using image data produced by microscopy or X-ray scattering techniques. To account for the random long-range disorder in BCP equilibrium structures, we introduce auxiliary variables to represent this aleatory uncertainty. These variables, however, result in an integrated likelihood for high-dimensional image data that is generally intractable to evaluate. We tackle this challenging Bayesian inference problem using a likelihood-free approach based on measure transport together with the construction of summary statistics for the image data. We also show that expected information gains (EIGs) from the observed data about the model parameters can be computed with no significant additional cost. Lastly, we present a numerical case study based on the Ohta--Kawasaki model for diblock copolymer thin film self-assembly and top-down microscopy characterization. For calibration, we introduce several domain-specific energy- and Fourier-based summary statistics, and quantify their informativeness using EIG. We demonstrate the power of the proposed approach to study the effect of data corruptions and experimental designs on the calibration results.  ( 2 min )
    Physics-Informed Statistical Modeling for Wildfire Aerosols Process Using Multi-Source Geostationary Satellite Remote-Sensing Data Streams. (arXiv:2206.11766v1 [stat.AP])
    Increasingly frequent wildfires significantly affect solar energy production as the atmospheric aerosols generated by wildfires diminish the incoming solar radiation to the earth. Atmospheric aerosols are measured by Aerosol Optical Depth (AOD), and AOD data streams can be retrieved and monitored by geostationary satellites. However, multi-source remote-sensing data streams often present heterogeneous characteristics, including different data missing rates, measurement errors, systematic biases, and so on. To accurately estimate and predict the underlying AOD propagation process, there exist practical needs and theoretical interests to propose a physics-informed statistical approach for modeling wildfire AOD propagation by simultaneously utilizing, or fusing, multi-source heterogeneous satellite remote-sensing data streams. Leveraging a spectral approach, the proposed approach integrates multi-source satellite data streams with a fundamental advection-diffusion equation that governs the AOD propagation process. A bias correction process is included in the statistical model to account for the bias of the physics model and the truncation error of the Fourier series. The proposed approach is applied to California wildfires AOD data streams obtained from the National Oceanic and Atmospheric Administration. Comprehensive numerical examples are provided to demonstrate the predictive capabilities and model interpretability of the proposed approach. Computer code has been made available on GitHub.  ( 2 min )
    Regression Trees on Grassmann Manifold for Adapting Reduced-Order Models. (arXiv:2206.11324v1 [stat.AP])
    Low dimensional and computationally less expensive Reduced-Order Models (ROMs) have been widely used to capture the dominant behaviors of high-dimensional systems. A ROM can be obtained, using the well-known Proper Orthogonal Decomposition (POD), by projecting the full-order model to a subspace spanned by modal basis modes which are learned from experimental, simulated or observational data, i.e., training data. However, the optimal basis can change with the parameter settings. When a ROM, constructed using the POD basis obtained from training data, is applied to new parameter settings, the model often lacks robustness against the change of parameters in design, control, and other real-time operation problems. This paper proposes to use regression trees on Grassmann Manifold to learn the mapping between parameters and POD bases that span the low-dimensional subspaces onto which full-order models are projected. Motivated by the fact that a subspace spanned by a POD basis can be viewed as a point in the Grassmann manifold, we propose to grow a tree by repeatedly splitting the tree node to maximize the Riemannian distance between the two subspaces spanned by the predicted POD bases on the left and right daughter nodes. Five numerical examples are presented to comprehensively demonstrate the performance of the proposed method, and compare the proposed tree-based method to the existing interpolation method for POD basis and the use of global POD basis. The results show that the proposed tree-based method is capable of establishing the mapping between parameters and POD bases, and thus adapt ROMs for new parameters.  ( 3 min )
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v1 [math.ST])
    Bi-stochastic normalization of kernelized graph affinity matrix provides an alternative normalization scheme for graph Laplacian methods in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations in practice. This paper proves the convergence of the bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data up to an additional error term proportional to the boundedness of mutual inner-products of the noise vectors. Our analysis suggests that, under the setting being considered in this paper, not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate. Motivated by the analysis, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination, and apply to simulated manifold data both clean and with outlier noise. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to outlier noise.  ( 3 min )
    Utilizing Expert Features for Contrastive Learning of Time-Series Representations. (arXiv:2206.11517v1 [cs.LG])
    We present an approach that incorporates expert knowledge for time-series representation learning. Our method employs expert features to replace the commonly used data transformations in previous contrastive learning approaches. We do this since time-series data frequently stems from the industrial or medical field where expert features are often available from domain experts, while transformations are generally elusive for time-series data. We start by proposing two properties that useful time-series representations should fulfill and show that current representation learning approaches do not ensure these properties. We therefore devise ExpCLR, a novel contrastive learning approach built on an objective that utilizes expert features to encourage both properties for the learned representation. Finally, we demonstrate on three real-world time-series datasets that ExpCLR surpasses several state-of-the-art methods for both unsupervised and semi-supervised representation learning.  ( 2 min )
    Minimax Optimal Fair Regression under Linear Model. (arXiv:2206.11546v1 [math.ST])
    We investigate the minimax optimal error of a fair regression problem under a linear model employing the demographic parity as a fairness constraint. As a tractable demographic parity constraint, we introduce $(\alpha,\delta)$-fairness consistency, meaning that the quantified unfairness is decreased at most $n^{-\alpha}$ rate with at least probability $1-\delta$, where $n$ is the sample size. In other words, the consistently fair algorithm eventually outputs a regressor satisfying the demographic parity constraint with high probability as $n$ tends to infinity. As a result of our analyses, we found that the minimax optimal error under the $(\alpha,\delta)$-fairness consistency constraint is $\Theta(\frac{dM}{n})$ provided that $\alpha \le \frac{1}{2}$, where $d$ is the dimensionality, and $M$ is the number of groups induced from the sensitive attributes. This is the first study revealing minimax optimality for the fair regression problem under a linear model.  ( 2 min )
    Neural Implicit Manifold Learning for Topology-Aware Generative Modelling. (arXiv:2206.11267v1 [stat.ML])
    Natural data observed in $\mathbb{R}^n$ is often constrained to an $m$-dimensional manifold $\mathcal{M}$, where $m < n$. Current generative models represent this manifold by mapping an $m$-dimensional latent variable through a neural network $f_\theta: \mathbb{R}^m \to \mathbb{R}^n$. Such procedures, which we call pushforward models, incur a straightforward limitation: manifolds cannot in general be represented with a single parameterization, meaning that attempts to do so will incur either computational instability or the inability to learn probability densities within the manifold. To remedy this problem, we propose to model $\mathcal{M}$ as a neural implicit manifold: the set of zeros of a neural network. To learn the data distribution within $\mathcal{M}$, we introduce constrained energy-based models, which use a constrained variant of Langevin dynamics to train and sample within the learned manifold. The resulting model can be manipulated with an arithmetic of manifolds which allows practitioners to take unions and intersections of model manifolds. In experiments on synthetic and natural data, we show that constrained EBMs can learn manifold-supported distributions with complex topologies more accurately than pushforward models.  ( 2 min )
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v1 [cs.LG])
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.  ( 3 min )
    Functional Nonlinear Learning. (arXiv:2206.11424v1 [stat.ML])
    Using representations of functional data can be more convenient and beneficial in subsequent statistical models than direct observations. These representations, in a lower-dimensional space, extract and compress information from individual curves. The existing representation learning approaches in functional data analysis usually use linear mapping in parallel to those from multivariate analysis, e.g., functional principal component analysis (FPCA). However, functions, as infinite-dimensional objects, sometimes have nonlinear structures that cannot be uncovered by linear mapping. Linear methods will be more overwhelmed given multivariate functional data. For that matter, this paper proposes a functional nonlinear learning (FunNoL) method to sufficiently represent multivariate functional data in a lower-dimensional feature space. Furthermore, we merge a classification model for enriching the ability of representations in predicting curve labels. Hence, representations from FunNoL can be used for both curve reconstruction and classification. Additionally, we have endowed the proposed model with the ability to address the missing observation problem as well as to further denoise observations. The resulting representations are robust to observations that are locally disturbed by uncontrollable random noises. We apply the proposed FunNoL method to several real data sets and show that FunNoL can achieve better classifications than FPCA, especially in the multivariate functional data setting. Simulation studies have shown that FunNoL provides satisfactory curve classification and reconstruction regardless of data sparsity.  ( 2 min )
    Projection-free Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data. (arXiv:2206.11346v1 [math.OC])
    We study a projection-free conditional gradient-type algorithm for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we establish that the number of calls to the stochastic first-order oracle and the linear minimization oracle to obtain an appropriately defined $\epsilon$-stationary point, are of the order $\mathcal{O}(1/\epsilon^{2.5})$ and $\mathcal{O}(1/\epsilon^{5.5})$ respectively. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.  ( 2 min )
    Improving decision-making via risk-based active learning: Probabilistic discriminative classifiers. (arXiv:2206.11616v1 [cs.LG])
    Gaining the ability to make informed decisions on operation and maintenance of structures provides motivation for the implementation of structural health monitoring (SHM) systems. However, descriptive labels for measured data corresponding to health-states of the monitored system are often unavailable. This issue limits the applicability of fully-supervised machine learning paradigms for the development of statistical classifiers to be used in decision-support in SHM systems. One approach to dealing with this problem is risk-based active learning. In such an approach, data-label querying is guided according to the expected value of perfect information for incipient data points. For risk-based active learning in SHM, the value of information is evaluated with respect to a maintenance decision process, and the data-label querying corresponds to the inspection of a structure to determine its health state. In the context of SHM, risk-based active learning has only been considered for generative classifiers. The current paper demonstrates several advantages of using an alternative type of classifier -- discriminative models. Using the Z24 Bridge dataset as a case study, it is shown that discriminative classifiers have benefits, in the context of SHM decision-support, including improved robustness to sampling bias, and reduced expenditure on structural inspections.  ( 2 min )
    A generalised form for a homogeneous population of structures using an overlapping mixture of Gaussian processes. (arXiv:2206.11683v1 [cs.LG])
    Reductions in natural frequency are often used as a damage indicator for structural health monitoring (SHM) purposes. However, fluctuations in operational and environmental conditions, changes in boundary conditions, and slight differences among nominally-identical structures can also affect stiffness, producing frequency changes that mimic or mask damage. This variability has limited the practical implementation and generalisation of SHM technologies. The aim of this work is to investigate the effects of normal variation, and to identify methods that account for the resulting uncertainty. This work considers vibration data collected from a set of four healthy full-scale composite helicopter blades. The blades were nominally-identical but distinct, and slight differences in material properties and geometry among the blades caused significant variability in the frequency response functions, which presented as four separate trajectories across the input space. In this paper, an overlapping mixture of Gaussian processes (OMGP), was used to generate labels and quantify the uncertainty of normal-condition frequency response data from the helicopter blades. Using a population-based approach, the OMGP model provided a generic representation, called a form, to characterise the normal condition of the blades. Additional simulated data were then compared against the form and evaluated for damage using a marginal-likelihood novelty index.  ( 2 min )
    A Topological characterisation of Weisfeiler-Leman equivalence classes. (arXiv:2206.11876v1 [cs.LG])
    Graph Neural Networks (GNNs) are learning models aimed at processing graphs and signals on graphs. The most popular and successful GNNs are based on message passing schemes. Such schemes inherently have limited expressive power when it comes to distinguishing two non-isomorphic graphs. In this article, we rely on the theory of covering spaces to fully characterize the classes of graphs that GNNs cannot distinguish. We then generate arbitrarily many non-isomorphic graphs that cannot be distinguished by GNNs, leading to the GraphCovers dataset. We also show that the number of indistinguishable graphs in our dataset grows super-exponentially with the number of nodes. Finally, we test the GraphCovers dataset on several GNN architectures, showing that none of them can distinguish any two graphs it contains.  ( 2 min )
    Inductive Conformal Prediction: A Straightforward Introduction with Examples in Python. (arXiv:2206.11810v1 [stat.ML])
    Inductive Conformal Prediction (ICP) is a set of distribution-free and model agnostic algorithms devised to predict with a user-defined confidence with coverage guarantee. Instead of having \textit{point predictions}, i.e., a real number in the case of regression or a single class in multi class classification, models calibrated using ICP output an interval or a set of classes, respectively. ICP takes special importance in high-risk settings where we want the real output to belong to the prediction set with high probability. As an example, a classification model might output that given a magnetic resonance image a patient has no latent diseases to report. However, this model output was based on the most likely class, the second most likely class might tell that the patient has a 15\% chance of brain tumor or other severe disease and therefore further exams should be conducted. Using ICP is therefore way more informative and we believe that should be the standard way of producing forecasts. This paper is a hands-on introduction, this means that we will provide examples as we introduce the theory.  ( 2 min )
    A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery. (arXiv:2206.11706v1 [eess.AS])
    Latent Dirichlet allocation (LDA) is widely used for unsupervised topic modelling on sets of documents. No temporal information is used in the model. However, there is often a relationship between the corresponding topics of consecutive tokens. In this paper, we present an extension to LDA that uses a Markov chain to model temporal information. We use this new model for acoustic unit discovery from speech. As input tokens, the model takes a discretised encoding of speech from a vector quantised (VQ) neural network with 512 codes. The goal is then to map these 512 VQ codes to 50 phone-like units (topics) in order to more closely resemble true phones. In contrast to the base LDA, which only considers how VQ codes co-occur within utterances (documents), the Markov chain LDA additionally captures how consecutive codes follow one another. This extension leads to an increase in cluster quality and phone segmentation results compared to the base LDA. Compared to a recent vector quantised neural network approach that also learns 50 units, the extended LDA model performs better in phone segmentation but worse in mutual information.  ( 2 min )
    Backward baselines: Is your model predicting the past?. (arXiv:2206.11673v1 [cs.LG])
    When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to which extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system's predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.  ( 2 min )
    Invariant Causal Mechanisms through Distribution Matching. (arXiv:2206.11646v1 [cs.LG])
    Learning representations that capture the underlying data generating process is a key problem for data efficient and robust use of neural networks. One key property for robustness which the learned representation should capture and which recently received a lot of attention is described by the notion of invariance. In this work we provide a causal perspective and new algorithm for learning invariant representations. Empirically we show that this algorithm works well on a diverse set of tasks and in particular we observe state-of-the-art performance on domain generalization, where we are able to significantly boost the score of existing models.  ( 2 min )
    A Geometric Method for Improved Uncertainty Estimation in Real-time. (arXiv:2206.11562v1 [cs.LG])
    Machine learning classifiers are probabilistic in nature, and thus inevitably involve uncertainty. Predicting the probability of a specific input to be correct is called uncertainty (or confidence) estimation and is crucial for risk management. Post-hoc model calibrations can improve models' uncertainty estimations without the need for retraining, and without changing the model. Our work puts forward a geometric-based approach for uncertainty estimation. Roughly speaking, we use the geometric distance of the current input from the existing training inputs as a signal for estimating uncertainty and then calibrate that signal (instead of the model's estimation) using standard post-hoc calibration techniques. We show that our method yields better uncertainty estimations than recently proposed approaches by extensively evaluating multiple datasets and models. In addition, we also demonstrate the possibility of performing our approach in near real-time applications. Our code is available at our Github https://github.com/NoSleepDeveloper/Geometric-Calibrator.  ( 2 min )

  • Open

    [Project] Semantic Search powerup for Ctrl+F
    Hi Reddit! Scout Search is a project I've been working on as a Find-in-Page replacement. It uses a semantic search engine (rather than character matching) to help you find what you're looking for on websites. Try it out and let me know what you think. https://chrome.google.com/webstore/detail/scout-search/hgljpodblkjjklailoaefokflfdeffdl submitted by /u/scoutsearchteam [link] [comments]  ( 83 min )
    [D] CVPR wants to penalize reviewers for violating the reviewer guideline!
    I cannot believe that CVPR put this motion for voting: Motion 3: "Any reviewer who has accepted an invitation to review but violates the reviewing guidelines set forth by the conference will be prohibited from submitting any papers to CVPR for up to two years." Reviewing is a community service, and although I have encountered bad and unfair reviews multiple times, I don't think such a wild action is the way to go to increase the review process quality. Let's start with the training process and choosing qualified AC and Meta ACs first where they can properly oversee the review process, choose fit reviewers, and take action in the rebuttal process. If this goes through I would never review for CVPR again. https://mobile.twitter.com/KostasPenn/status/1539805992145358850 submitted by /u/aifordummies [link] [comments]  ( 89 min )
    [P] Farewell, CUDA OOM: Automatic Gradient Accumulation
    Hey everyone, If you've trained a lot of neural nets, you probably know the pain of getting CUDA OOM errors and iteratively tuning your batch size to avoid them. Which is why I'm excited to announce that we (MosaicML) just released an automatic way to avoid these errors. Namely, we just added automatic gradient accumulation to Composer, our open source library for faster + easier neural net training. If you're not familiar with gradient accumulation, it's like tuning the batch size, but without messing with the optimization (aside from slightly different BatchNorm stats). This lets you avoid tuning learning rate, weight decay, etc based on how much memory your GPU has or how many GPUs you're training on. https://preview.redd.it/ogxq73znuf791.png?width=1374&format=png&auto=webp&s=93ff0b76a2293a73a5380b7e93f62fe34c604bc4 What's nice about the *automatic* gradient accumulation in Composer is that you just set the batch size and hparams once and you're done—no need to tune the gradient accumulation manually. More info in our blog post, and special thanks to Mihir Patel for building most of this. Happy to answer questions! submitted by /u/ffast-math [link] [comments]  ( 85 min )
    [P] HyperImpute: sklearn-style library for handling missing data using novel algorithms
    There are many data imputation algorithms for machine learning. However, benchmarking them can be complicated, mainly because most implementations stay just as research code to reproduce the experiments in the papers. Moreover, when dealing with tabular data, you need to handle continuous/discrete/categorical data correctly -- not just let some regressor approximate everything. HyperImpute is a library that should make it easy to benchmark new imputation algorithms while offering several state-of-the-art models. For example, imputing using MIWAE can be done as easy as this: import pandas as pd import numpy as np from hyperimpute.plugins.imputers import Imputers X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]]) plugin = Imputers().get("miwae") out = plugin.fit_transform(X.copy()) out Bonus, it can be easily plugged into sklearn pipelines. Github page: https://github.com/vanderschaarlab/hyperimpute submitted by /u/ManagementBig2995 [link] [comments]  ( 84 min )
    [D] "Wrapping" effects when using diffusion model to generate samples?
    I've recently been training a latent diffusion model (it operated on the latent space of a VQ-VAE), and I'm finding that my generated samples have "wrapping" effects, i.e.: when I generate the face it wraps up (bottom half of the face in the top half of the image and vice versa). It's worth noting that these halves don't always seem like they belong together, but they individually look quite realistic. I've checked my training data, and there are absolutely no training samples that exhibit this behaviour, so my model never sees images that exhibit this wrapping effect, so what could be causing this? submitted by /u/Pedimus [link] [comments]  ( 84 min )
    [R] Learning to Play Minecraft with Video PreTraining (VPT)
    OpenAI Blog: Learning to Play Minecraft with Video PreTraining (VPT) OpenAI gathered a large dataset of human Minecraft demonstrations and trained an Inverse Dynamics Model (IDM) transformer that predicts actions based on past and future frames using a dataset of human demonstrations. They used this model to label 70k hours of video, which is used to train a Video PreTraining (VPT) model, which predicts actions based on past frames alone, using behavioral cloning (i.e. supervised learning). They can then fine-tune the VPT via behavioral cloning on narrower datasets or RL (with a hand-designed reward function that rewards the agent for going deeper into the tech tree or obtaining materials that could lead to a diamond pickaxe) and are able to train an agent that can craft a diamond pickaxe in 2.5% of its 10-minute long episodes. submitted by /u/gambs [link] [comments]  ( 85 min )
    [P] AutoRegistry: A Python library for mapping names to functionality to simplify project configurations.
    A common design pattern I see in a lot of ML projects is to have some sort of experiment configuration file, and then a bunch of code that constructs the appropriate objects based on these configurations. Frequently, the resulting code blocks have a bunch of if/elif/else statements, or a manually created lookup dictionary somewhere. This can quickly get messy and inconsistent as you add new models/losses/encoders/optimizers. AutoRegistry is a library that makes all of these lookups more organized and terse. For example, lets say you want to configure a backbone to either be "resnet34" or "resnet50". Your code could look something like this (mimicking torchvision code) using a decorator: ``` from autoregistry import Registry models = Registry() @models def resnet34(, weights: Optional[ResNet34_Weights] = None, progress: bool = True, *kwargs: Any) -> ResNet: return _resnet(BasicBlock, [3, 4, 6, 3], weights, progress, **kwargs) @models def resnet50(, weights: Optional[ResNet50_Weights] = None, progress: bool = True, *kwargs: Any) -> ResNet: return _resnet(Bottleneck, [3, 4, 6, 3], weights, progress, **kwargs) create a model based off of some configuration dictionary. model_config = copy(config["model"]) model_type = model_config.pop("type") model = models[model_type](**model_config) ``` or, class-based inheritance (uses metaclasses internally): ``` class BaseModel(nn.Module, Registry): pass class MyNewModel(BaseModel): pass class SomeOtherModel(BaseModel): pass stringified keys are automatically derived. my_new_model = BaseModel["mynewmodel"](**config) some_other_model = BaseModel["someothermodel"](**config) ``` Github Page: https://github.com/BrianPugh/autoregistry submitted by /u/guyfrom7up [link] [comments]  ( 84 min )
    [P] Reverse Engineering Google Colab
    Hi! I've spent a lot of time working with Google Colab recently, and was disappointed that such a powerful platform was limited to only running Jupyter notebooks. So I took a deep dive into the internals of Colab, discovering tons of interesting hidden features! Take a look at what I found! submitted by /u/vikarjramun [link] [comments]  ( 84 min )
    [R] Can interpretability improve model accuracy?!
    Deep learning models are often complex and mostly uninterpretable. • One strategy is to learn the nonlinear relation of features. But, there are so many features to learn from: • Research shows a set of important features can improve the learning process. • So let's focus on the most correlated features. Paper📜: https://arxiv.org/abs/2203.04383 submitted by /u/AshkanF [link] [comments]  ( 83 min )
    [P] Data search engine for ML in Binder
    Open source data search engine for ML. Binder link: https://mybinder.org/v2/gh/upgini/upgini/main?urlpath=notebooks%2Fnotebooks%2Fkaggle_example.ipynb Colab link: https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb Github: https://github.com/upgini/upgini submitted by /u/AnnualLimp1418 [link] [comments]  ( 83 min )
    [D] How Imagen Actually Works
    Hey everyone! I wrote this article explaining how Imagen actually works, with a general overview for the big picture ideas and a Deep Dive to get into the nitty-gritty. I'm happy to answer any questions, let me know what you think! https://preview.redd.it/17xc5fqeud791.png?width=3472&format=png&auto=webp&s=e78a024892a3032ffc0c143b7843a5223751afcb submitted by /u/SleekEagle [link] [comments]  ( 84 min )
    State of the art 2D body pose estimation [Discussion]
    Hi. I have a background in neuroscience and sometimes we use DeepLabCut to track animals during behaviour. This is by far the most widespread and used application for animal tracking based on artificial neural networks. I was wondering, if anyone here is an expert in human 2D body pose estimation and can tell me what their oppinion is on what is the best human 2D pose estimation tool currently available? I came across Pose from mediapipe and it seems very good from a few examples I tested so far but I'm curious if there's something even better that I have not come across. Thanks for the help! submitted by /u/lux123or [link] [comments]  ( 84 min )
    [D] [P] A TensorFlow Re-Implementation of CheXNet - Classification and Localization of Thoracic Diseases
    TL:DR; need help making heatmaps! [Repository|Colab Notebook] Hey everyone - I've been working to reproduce CheXNet - a fantastic paper describing research on a model capable of radiologist-grade pathology classification! CheXNet uses Class Activation Mappings (CAMs for short) to generate heatmaps that identify what parts of the image the model uses to base its classification. In my case, I'm facing a bit of a struggle reproducing them - as shown in the image below, most of our classifications are derived from the diaphragm, instead of regions within the lung. Curiously, we are attaining a reasonable AUROC, with .773 on training and .749 on validation data - the paper reports .8062 AUROC. My current model is being trained on a subsample of the main dataset, and I'm basically looking to this as a way to validate the architecture. I'd love to know if anyone has experienced similar issues and solved them, and could have any input here as well. If you have a moment to spare - I'd be super grateful for some help from the r/MachineLearning community in solving the inaccurate localization issue - #58! Fig 1. An incorrect localization, despite a correct classification. submitted by /u/codeinassembly [link] [comments]  ( 85 min )
    [P] Yandex open sources 100b large language model weights (YaLM)
    PR Announcement: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6 Github: https://github.com/yandex/YaLM-100B Network is trained using same principles as Megatron LM, inference alone will require 4 A100s submitted by /u/htrp [link] [comments]  ( 88 min )
    [N] Microsoft released a DirectML Plugin for TensorFlow 2
    The plugin provides a DirectML PluggableDevice backend for TensorFlow 2, so any GPU which supports DirectX 12 should be able to work with TF2. Hopefully this will pave the way for more support for non-NVIDIA GPUs in ML. They provide some more details (installation, code samples, etc') in the Windows AI devblog. submitted by /u/chromeplated [link] [comments]  ( 84 min )
    [D] Do any Text-to-Image approaches work well with long complex prompts (i.e. paragraph or book chapter scale)?
    Seems almost all the examples of text-to-image are based on tiny prompts with very few details ("avocado chair"). Do any such systems do a good job at keeping track of details - like the first 2 paragraphs of The Hobbit and correctly place the "polished chairs", "pegs for hats and coats", and "deep-set round windows looking over his garden, and meadows beyond, sloping down to the river"? Assuming they don't - what approach(es) might make sense to design such systems? I'm speculating that you'd need much larger embedding vectors (to correctly connect concepts from the right adjectives to the right nouns); and it'd be harder to find training data (perhaps frames of movies from novels would be a good source)? Any pointers to anything in that direction? submitted by /u/Appropriate_Ant_4629 [link] [comments]  ( 85 min )
    [Project] h5 model to onnx model in JAVA
    I have a trained model (.h5) saved. I need to do the following in Java. Can I load this model and then convert it to an onnx model and save that onnx model? Any lead is appreciated! submitted by /u/Negative_Internet514 [link] [comments]  ( 83 min )
    [Discussion]
    Part of my graduation project is to classify body organs such as the heart and liver. I searched a lot and did not find anything, so I decided to bring a 3d models of the body organs and start working on collecting a dataset . I collected about 100,000 images for each of the 4 organs. My question here is whether this data is considered It has any value, in other words, to put it on Kaggle or any another website or has no value ? submitted by /u/NourOmran [link] [comments]  ( 84 min )
    [D] Implementing custom functions in pytorch e.g. feature propagation (PointNet++)
    Apologies if this isn't the right place to ask. But I'm currently studying point cloud-based networks like pointcloud++, and all the related 3d object detection networks like pointpillars, voxelnet, etc. While I (think) understand the algorithms like feature propagation in pointnet++. I'm having trouble understanding how would one implement them. Or Where could I learn about writing operations in cuda and making sure they are compatible with backprop? submitted by /u/wowAmaze [link] [comments]  ( 84 min )
  • Open

    NVIDIA’s GANCraft AI: Feels Like Magic! 🌴
    submitted by /u/the_anonymizer [link] [comments]  ( 82 min )
    Should sentient Artificial intelligence be legally protected?
    submitted by /u/Tell_Nervous [link] [comments]  ( 83 min )
    Why Google’s LaMDA AI is conscious: Suspended Google engineer Blake Lemoine speaks out in first podcast interview
    submitted by /u/DrJamesCooke [link] [comments]  ( 83 min )
    Have you ever used an AI text-to-image generator? [Short survey]
    submitted by /u/KazRainer [link] [comments]  ( 82 min )
    DALL-E 2 could become OpenAI's first money printing machine
    submitted by /u/much_successes [link] [comments]  ( 83 min )
    How does a optical quantum neural network work? thanks
    submitted by /u/OneFinding1429 [link] [comments]  ( 83 min )
    Some hellish art i promted from dall-e mini in the style of one of my favorite artists.
    submitted by /u/SuperCasualGamerDad [link] [comments]  ( 82 min )
    Dalle2 Prompts
    submitted by /u/KrinoDaGamer [link] [comments]  ( 82 min )
    AI can predict your political ideology using just a brain scan
    submitted by /u/nagual901 [link] [comments]  ( 83 min )
    Hey, guys! I am new to Face AI and computer vision and planning to build a lie detector using Face AI technology. Would it be possible? Is anyone already doing this?
    submitted by /u/adilonreddit1 [link] [comments]  ( 83 min )
    Using Craiyon, I made the first image. I then put that image into Starryai, and made the second image. AI art inception
    submitted by /u/VastlyArtistic [link] [comments]  ( 82 min )
    We have AI generated art now. We have AI generated conversation. But where are the AI generated music compositions?
    AI generated images from text prompts are making the rounds with Dalle mini and DALLE.2. These systems are so powerful that people are admitting they cannot tell real from fake images anymore. Google's LaMDA is producing conversational text chats that are so realistic that they spawned entire subreddits where users claim the software agent has become sentient. So where is the instrumental and orchestral music that is indifferentiable from human composers? In recent months I had heard some song continuations, where an AI was trained on the wave form of popular music, which was asked to continue. Those were fine, but ended up sounding like strange incoherent fever dreams. I fiddled with some midi-like continuations on a website. The output was janky, repetitive, and obviously computer-…  ( 98 min )
    Latest AI tools in different languages?
    Hi there, There are many amazing tools powered by AI or ML but most of them are available only in English. How hard would it be to adapt them to my own language, which is not English? Google translate doesn't do a very good job translating.... Thanks! submitted by /u/decixl [link] [comments]  ( 83 min )
    NEON PSYHEDELIC TEMPLES | FAST MODE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    deep reinforcment learning for games
    Hey everyone i have some questions that i hope you can help me with: -im looking for ressources for reinforcment and deep reinforcment learning i want to know if some of you guys implemented rl within a 3d game and can give me some advice about how that works ( how to make the agent understand its environment in 3d and so fourth) Thanks! submitted by /u/naffra [link] [comments]  ( 83 min )
    How to correctly pass history into Gym observation space?
    I'm new to reinforcement learning, but from research have found that SOTA algo's whether value or policy based are not able to gracefully ignore irrelevant information in the observation space - https://arxiv.org/pdf/2011.00756.pdf. I need to keep track of history to handle performance and correctly implement actions; some of these values directly correspond to decision making, but most correspond to environment inspection (e.g. monitoring performance in a more human friendly fashion). From various sources, I've found that by keeping this copy of history, I'm still able to maintain the Markov assumption but have found limited practical examples. Specifically, should I maintain history outside of the observation space and just as an environment instance variable or can/should the histor…  ( 88 min )
    Effectiveness of Q learning for two player games?
    Essentially, what I'm asking is if a single q learning model (table or neural network) trained against itself in an environment can learn to perform optimally in the general case of the environment at hand? For instance, against a random player and a decent player, can q learning perform well or optimally after doing the initial training against itself? I've tried implementing it with tic tac toe and it seems to give decent but not amazing results. I want to at least know if my fundamental approach is appropriate, so I can resolve any other bugs due to implementation. I use a single q table by switching the markings of the board for each player (X vs O) and then using this as a key to look in the q table. Essentially, the table is not directly in X/O form, but is expressed as the agent vs the adversary. The next state for the table is not the board after the agent places a mark, but the board after the adversary responds. I'm assuming this would simply be the general stochastic MDP in the way I've framed the problem? Perhaps I need the learn rate of the q table to be decreasing in the fashion needed for value iteration (sum of rates are infinite, sum of squared rates is not)? I've tried various values of epsilon for exploration vs exploitation. Any help would be much appreciated! submitted by /u/Spiritual_Dinner9232 [link] [comments]  ( 83 min )
    Is PPO still SATO in 2022 ?
    Hello guys, I was wondering if PPO was still the most broadly used algorithm for continuous control in 2022 ? submitted by /u/Jogima-cyber [link] [comments]  ( 83 min )
    DeepMind Researchers Develop ‘BYOL-Explore’: A Curiosity-Driven Exploration Algorithm That Harnesses The Power Of Self-Supervised Learning To Solve Sparse-Reward Partially-Observable Tasks
    Reinforcement learning (RL) requires exploration of the environment. Exploration is even more critical when extrinsic incentives are few or difficult to obtain. Due to the massive size of the environment, it is impractical to visit every location in rich settings due to the range of helpful exploration paths. Consequently, the question is: how can an agent decide which areas of the environment are worth exploring? Curiosity-driven exploration is a viable approach to tackle this problem. It entails learning a world model, a predictive model of specific knowledge about the world, and (ii) exploiting disparities between the world model’s predictions and experience to create intrinsic rewards. An RL agent that maximizes these intrinsic incentives steers itself toward situations where the world model is unreliable or unsatisfactory, creating new paths for the world model. In other words, the quality of the exploration policy is influenced by the characteristics of the world model, which in turn helps the world model by collecting new data. Therefore, it might be crucial to approach learning the world model and learning the exploratory policy as one cohesive problem to be solved rather than two separate tasks. Deepmind researchers keeping this in mind, introduced a curiosity-driven exploration algorithm BYOL-Explore. Its attraction stems from its conceptual simplicity, generality, and excellent performance. Continue reading | Checkout the paper, blog post https://i.redd.it/5d8iz0r1me791.gif submitted by /u/Embarrassed-Fee5513 [link] [comments]  ( 84 min )
    Combining dynamic movement primitives to create new ones
    This is my very first question and I want to thank everyone for the massive contribution to the community! On to my question now, here is a quick definition of my idea/problem: I have created a "set of knowledge" from dynamic movement primitives, [d1, d2, ..., dn] for the exact same task but for slightly different characteristics of the scenarios every time. Given the "set of knowledge" of DMPs for different scenarios and also the characteristics of a completely new scenario, how can I create a new DMP for this new scenario using the existing "knowledge". I was thinking of a way to represent the weights of the DMPs as Gaussians, apply weights to each one of the Gaussians and perform an evolution of algorithm to update the weights and keep the most impactful DMPs. Please feel free to propose any other ideas, papers, techniques that could help me approach this problem. Thank you in advance submitted by /u/Stelios_ml [link] [comments]  ( 84 min )
    An introduction to ML-Agents with Hugging Face 🤗 (Deep Reinforcement Learning Free Class)
    Hey there! I'm happy to announce that we just published a new tutorial on ML-Agents (a library containing environments made with Unity). In fact, at Hugging Face, we created a new ML-Agents version where: - You don't need to install Unity or know how to use the Unity Editor. - You can publish your models to the Hugging Face Hub for free. - You can visualize your agent playing directly on your browser 👀. So in this tutorial, you’ll train an agent that needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top. The tutorial 👉 https://medium.com/p/efbac62c8c80 https://preview.redd.it/99s0x07ayd791.png?width=1050&format=png&auto=webp&s=f4ef3978b36a63223be2e5d0cf2974ab97d3cecb Do you just want to play with some trained agents? We have live demos you can try 🔥: - Worm 🐍: https://huggingface.co/spaces/unity/ML-Agents-Worm - PushBlock 🧊: https://huggingface.co/spaces/unity/ML-Agents-PushBlock - Pyramids 🏆: https://huggingface.co/spaces/unity/ML-Agents-Pyramids - Walker 🚶: https://huggingface.co/spaces/unity/ML-Agents-Walker ​ https://preview.redd.it/r7dqmywbyd791.png?width=1435&format=png&auto=webp&s=f0bdcf82ed2ba35101159d442dcfdaf6eb4d98ee If you have questions and feedback, I would love to answer them. Keep Learning, Stay awesome 🤗 submitted by /u/cranthir_ [link] [comments]  ( 83 min )
    I have an idea that makes sense but is not working :/
    Hello Hello, Heads up, it might sound complicated but it is a simple idea. I have a RL agent trying to solve a certain problem, with training using PPO. I also have an expert, i.e. an agent that already knows how to tackle the given problem. I am assuming that in simulations, I have access to the expert policy (meaning I can easily generate trajectories using the expert). I am trying to use the expert to help with speeding up the learning of my agent. A pseudo-code of my "act" function is as follows: https://preview.redd.it/0kw11vpkz9791.png?width=447&format=png&auto=webp&s=ba07d7cee9377fd949af64e6574a5c3e56e4d4f1 So basically, if use_expert is 0, nothing is new, it is normal act function where the agent gets actions based on its own actor network. if use_expert is 1, the only difference is that the agent no longer samples actions based on its own actor, but it gets the action suggested by the expert. Since PPO requires logprobs, I still get the logprob based on the agent's own distribution, but using the action suggested by the expert. My main aim here is, if I introduce this for a small portion of the learning, my agent would have exposure to more rewarding experiences, and hopefully learn faster. I have a hyperparameter (expert_rate) that determines how frequently I use expert actions in my learning (how frequently i set use_expert to 1). However, this doesnt seem to be working. As a matter of fact, for fun I set expert_rate to 100% (i.e. the agent is always acting in the environment based on the expert suggestions), and I notice no learning whatsoever. I am already familiar with the works that try to incorporate imitation learning with RL, but i'm trying to avoid using imitation learning (issues related to the problem I'm solving). Any idea what could be the problem? submitted by /u/AhmedNizam_ [link] [comments]  ( 86 min )
  • Open

    Family Style: Li Auto L9 Brings Top-Line Luxury and Intelligence to Full-Size SUV With NVIDIA DRIVE Orin
    Finally, there’s a family car any kid would want to be seen in. Beijing-based startup Li Auto this week rolled out its second electric vehicle, the L9. It’s a full-size SUV decked out with the latest intelligent driving technology. With AI features and an extended battery range of more than 800 miles, the L9 promises Read article > The post Family Style: Li Auto L9 Brings Top-Line Luxury and Intelligence to Full-Size SUV With NVIDIA DRIVE Orin appeared first on NVIDIA Blog.  ( 5 min )
    Making an Impact: GFN Thursday Transforms Macs Into GeForce Gaming PCs
    Thanks to the GeForce cloud, even Mac users can be PC gamers. This GFN Thursday, fire up your Macbook and get your game on. This week brings eight more games to the GeForce NOW library. Plus, members can play Genshin Impact and claim a reward to start them out on their journeys streaming on GeForce Read article > The post Making an Impact: GFN Thursday Transforms Macs Into GeForce Gaming PCs appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Import data from cross-account Amazon Redshift in Amazon SageMaker Data Wrangler for exploratory data analysis and data preparation
    Organizations moving towards a data-driven culture embrace the use of data and machine learning (ML) in decision-making. To make ML-based decisions from data, you need your data available, accessible, clean, and in the right format to train ML models. Organizations with a multi-account architecture want to avoid situations where they must extract data from one […]  ( 7 min )
    Predict types of machine failures with no-code machine learning using Amazon SageMaker Canvas
    Predicting common machine failure types is critical in manufacturing industries. Given a set of characteristics of a product that is tied to a given type of failure, you can develop a model that can predict the failure type when you feed those attributes to a machine learning (ML) model. ML can help with insights, but […]  ( 10 min )
  • Open

    Learning to Play Minecraft with Video PreTraining (VPT)
    We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over  ( 8 min )
  • Open

    Robots play with play dough
    A new system lets robots manipulate soft, deformable material into various shapes from visual inputs, which could one day enable better home assistants.  ( 6 min )
  • Open

    GODEL: Combining goal-oriented dialog with real-world conversations
    They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation […] The post GODEL: Combining goal-oriented dialog with real-world conversations appeared first on Microsoft Research.  ( 11 min )
  • Open

    The quality of an RNG depends on the application
    A random number generator can be good for some purposes and not for others. This isn’t surprising given the fundamentally impossible task such generators are supposed to perform. Technically a random number generator is a pseudo random number generator because it cannot produce random numbers. But random is as random does, and for many purposes […] The quality of an RNG depends on the application first appeared on John D. Cook.  ( 6 min )
  • Open

    How AI is Stopping Money Laundering
    Anti-money laundering (AML) and know-your-customer (KYC) compliance might be transformed by artificial intelligence (AI). Artificial intelligence systems may also mine vast amounts of data through KYC verification companies for risk-relevant information for anti-money laundering reasons, making identifying high-risk clients easier. AI is beneficial when completing repetitive activities since it saves time, effort, and resources that… Read More »How AI is Stopping Money Laundering The post How AI is Stopping Money Laundering appeared first on Data Science Central.  ( 19 min )
    Basic E-Discovery Concepts Every Attorney Should Know
    Electronic Discovery or E-Discovery is a process wherein electronic data is found, secured, and then searched in order to find effective evidence during criminal or civil legal procedures. Electronic discovery can also be carried out without the connectivity of the Internet from a local computer. Government or court-ordered hacking in order get critical information as… Read More »Basic E-Discovery Concepts Every Attorney Should Know The post Basic E-Discovery Concepts Every Attorney Should Know appeared first on Data Science Central.  ( 18 min )
    Value of Real-Time Data Visualization and Interpretation
    Representation of data using graphics such as charts, plots, infographics, heat maps, bubble clouds, scatter plots, mekko charts are referred to as data visualization. Such visual displays and representation of information help communicate complex data relationships and data-driven insights in a way that makes it easy to understand and base decisions on. The goal of… Read More »Value of Real-Time Data Visualization and Interpretation The post Value of Real-Time Data Visualization and Interpretation appeared first on Data Science Central.  ( 20 min )
    Blueprint for Building a Data Product Business
    Note: I got feedback that my Data Product Blueprint process in Figure 7 was waterfall, not agile.  Totally agree and that’s my bad.  I’ve updated the image and will release a future blog to address the questions that I got about that process.  Thanks for your feedback! A Blueprint is a detailed design plan of… Read More »Blueprint for Building a Data Product Business The post Blueprint for Building a Data Product Business appeared first on Data Science Central.  ( 21 min )
    AI Goes Mainstream
    The initial uptake of AI was within financial services – that still continues but we are now seeing adoption beyond traditional industries dominated by AI. The CB insights AI 100 is an annual list of interesting AI companies. This year, I saw companies applying AI to nontraditional sectors These areas are relatively hard to acquire data for at… Read More »AI Goes Mainstream The post AI Goes Mainstream appeared first on Data Science Central.  ( 18 min )
  • Open

    GEMv2: Multilingual NLG Benchmarking in a Single Line of Code. (arXiv:2206.11249v1 [cs.CL])
    Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.  ( 3 min )
    Learning Monotone Dynamics by Neural Networks. (arXiv:2006.06417v2 [cs.LG] UPDATED)
    Feed-forward neural networks (FNNs) work as standard building blocks in applying artificial intelligence (AI) to the physical world. They allow learning the dynamics of unknown physical systems (e.g., biological and chemical) {to predict their future behavior}. However, they are likely to violate the physical constraints of those systems without proper treatment. This work focuses on imposing two important physical constraints: monotonicity (i.e., a partial order of system states is preserved over time) and stability (i.e., the system states converge over time) when using FNNs to learn physical dynamics. For monotonicity constraints, we propose to use nonnegative neural networks and batch normalization. For both monotonicity and stability constraints, we propose to learn the system dynamics and corresponding Lyapunov function simultaneously. As demonstrated by case studies, our methods can preserve the stability and monotonicity of FNNs and significantly reduce their prediction errors.  ( 2 min )
    AlphaMLDigger: A Novel Machine Learning Solution to Explore Excess Return on Investment. (arXiv:2206.11072v1 [q-fin.CP])
    How to quickly and automatically mine effective information and serve investment decisions has attracted more and more attention from academia and industry. And new challenges have been raised with the global pandemic. This paper proposes a two-phase AlphaMLDigger that effectively finds excessive returns in the highly fluctuated market. In phase 1, a deep sequential NLP model is proposed to transfer blogs on Sina Microblog to market sentiment. In phase 2, the predicted market sentiment is combined with social network indicator features and stock market history features to predict the stock movements with different Machine Learning models and optimizers. The results show that our AlphaMLDigger achieves higher accuracy in the test set than previous works and is robust to the negative impact of COVID-19 to some extent.  ( 2 min )
    Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics. (arXiv:2203.01170v2 [math.OC] UPDATED)
    We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate compared to the best stabilizing linear controller in hindsight. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.  ( 2 min )
    Explainable Artificial Intelligence Methods in Combating Pandemics: A Systematic Review. (arXiv:2112.12705v3 [cs.AI] UPDATED)
    Despite the myriad peer-reviewed papers demonstrating novel Artificial Intelligence (AI)-based solutions to COVID-19 challenges during the pandemic, few have made significant clinical impact. The impact of artificial intelligence during the COVID-19 pandemic was greatly limited by lack of model transparency. This systematic review examines the use of Explainable Artificial Intelligence (XAI) during the pandemic and how its use could overcome barriers to real-world success. We find that successful use of XAI can improve model performance, instill trust in the end-user, and provide the value needed to affect user decision-making. We introduce the reader to common XAI techniques, their utility, and specific examples of their application. Evaluation of XAI results is also discussed as an important step to maximize the value of AI-based clinical decision support systems. We illustrate the classical, modern, and potential future trends of XAI to elucidate the evolution of novel XAI techniques. Finally, we provide a checklist of suggestions during the experimental design process supported by recent publications. Common challenges during the implementation of AI solutions are also addressed with specific examples of potential solutions. We hope this review may serve as a guide to improve the clinical impact of future AI-based solutions.  ( 3 min )
    Variational Causal Dynamics: Discovering Modular World Models from Interventions. (arXiv:2206.11131v1 [cs.LG])
    Latent world models allow agents to reason about complex environments with high-dimensional observations. However, adapting to new environments and effectively leveraging previous knowledge remain significant challenges. We present variational causal dynamics (VCD), a structured world model that exploits the invariance of causal mechanisms across environments to achieve fast and modular adaptation. By causally factorising a transition model, VCD is able to identify reusable components across different environments. This is achieved by combining causal discovery and variational inference to learn a latent representation and transition model jointly in an unsupervised manner. Specifically, we optimise the evidence lower bound jointly over a representation model and a transition model structured as a causal graphical model. In evaluations on simulated environments with state and image observations, we show that VCD is able to successfully identify causal variables, and to discover consistent causal structures across different environments. Moreover, given a small number of observations in a previously unseen, intervened environment, VCD is able to identify the sparse changes in the dynamics and to adapt efficiently. In doing so, VCD significantly extends the capabilities of the current state-of-the-art in latent world models while also comparing favourably in terms of prediction accuracy.  ( 2 min )
    Supervised Graph Contrastive Learning for Few-shot Node Classification. (arXiv:2203.15936v3 [cs.LG] UPDATED)
    Graphs are present in many real-world applications, such as financial fraud detection, commercial recommendation, and social network analysis. But given the high cost of graph annotation or labeling, we face a severe graph label-scarcity problem, i.e., a graph might have a few labeled nodes. One example of such a problem is the so-called \textit{few-shot node classification}. A predominant approach to this problem resorts to \textit{episodic meta-learning}. In this work, we challenge the status quo by asking a fundamental question whether meta-learning is a must for few-shot node classification tasks. We propose a new and simple framework under the standard few-shot node classification setting as an alternative to meta-learning to learn an effective graph encoder. The framework consists of supervised graph contrastive learning with novel mechanisms for data augmentation, subgraph encoding, and multi-scale contrast on graphs. Extensive experiments on three benchmark datasets (CoraFull, Reddit, Ogbn) show that the new framework significantly outperforms state-of-the-art meta-learning based methods.  ( 2 min )
    Beyond Low-pass Filtering: Graph Convolutional Networks with Automatic Filtering. (arXiv:2107.04755v3 [cs.LG] UPDATED)
    Graph convolutional networks are becoming indispensable for deep learning from graph-structured data. Most of the existing graph convolutional networks share two big shortcomings. First, they are essentially low-pass filters, thus the potentially useful middle and high frequency band of graph signals are ignored. Second, the bandwidth of existing graph convolutional filters is fixed. Parameters of a graph convolutional filter only transform the graph inputs without changing the curvature of a graph convolutional filter function. In reality, we are uncertain about whether we should retain or cut off the frequency at a certain point unless we have expert domain knowledge. In this paper, we propose Automatic Graph Convolutional Networks (AutoGCN) to capture the full spectrum of graph signals and automatically update the bandwidth of graph convolutional filters. While it is based on graph spectral theory, our AutoGCN is also localized in space and has a spatial form. Experimental results show that AutoGCN achieves significant improvement over baseline methods which only work as low-pass filters.  ( 2 min )
    Nonparametric Multi-shape Modeling with Uncertainty Quantification. (arXiv:2206.09127v2 [stat.ML] UPDATED)
    The modeling and uncertainty quantification of closed curves is an important problem in the field of shape analysis, and can have significant ramifications for subsequent statistical tasks. Many of these tasks involve collections of closed curves, which often exhibit structural similarities at multiple levels. Modeling multiple closed curves in a way that efficiently incorporates such between-curve dependence remains a challenging problem. In this work, we propose and investigate a multiple-output (a.k.a. multi-output), multi-dimensional Gaussian process modeling framework. We illustrate the proposed methodological advances, and demonstrate the utility of meaningful uncertainty quantification, on several curve and shape-related tasks. This model-based approach not only addresses the problem of inference on closed curves (and their shapes) with kernel constructions, but also opens doors to nonparametric modeling of multi-level dependence for functional objects in general.  ( 2 min )
    Scalable and Efficient Training of Large Convolutional Neural Networks with Differential Privacy. (arXiv:2205.10683v2 [cs.LG] UPDATED)
    Large convolutional neural networks (CNN) can be difficult to train in the differentially private (DP) regime, since the optimization algorithms require a computationally expensive operation, known as the per-sample gradient clipping. We propose an efficient and scalable implementation of this clipping on convolutional layers, termed as the mixed ghost clipping, that significantly eases the private training in terms of both time and space complexities, without affecting the accuracy. The improvement in efficiency is rigorously studied through the first complexity analysis for the mixed ghost clipping and existing DP training algorithms. Extensive experiments on vision classification tasks, with large ResNet, VGG, and Vision Transformers, demonstrate that DP training with mixed ghost clipping adds $1\sim 10\%$ memory overhead and $<2\times$ slowdown to the standard non-private training. Specifically, when training VGG19 on CIFAR10, the mixed ghost clipping is $3\times$ faster than state-of-the-art Opacus library with $18\times$ larger maximum batch size. To emphasize the significance of efficient DP training on convolutional layers, we achieve 96.7\% accuracy on CIFAR10 and 83.0\% on CIFAR100 at $\epsilon=1$ using BEiT, while the previous best results are 94.8\% and 67.4\%, respectively. We open-source a privacy engine (\url{https://github.com/JialinMao/private_CNN}) that implements DP training of CNN with a few lines of code.  ( 2 min )
    The Privacy Onion Effect: Memorization is Relative. (arXiv:2206.10469v2 [cs.LG] UPDATED)
    Machine learning models trained on private datasets have been shown to leak their private data. While recent work has found that the average data point is rarely leaked, the outlier samples are frequently subject to memorization and, consequently, privacy leakage. We demonstrate and analyse an Onion Effect of memorization: removing the "layer" of outlier points that are most vulnerable to a privacy attack exposes a new layer of previously-safe points to the same attack. We perform several experiments to study this effect, and understand why it occurs. The existence of this effect has various consequences. For example, it suggests that proposals to defend against memorization without training with rigorous privacy guarantees are unlikely to be effective. Further, it suggests that privacy-enhancing technologies such as machine unlearning could actually harm the privacy of other users.  ( 2 min )
    Multiple Testing Framework for Out-of-Distribution Detection. (arXiv:2206.09522v2 [stat.ML] UPDATED)
    We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a learning algorithm's output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the learning algorithm, which provides insights for the construction of powerful tests for OOD detection. We propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the learning algorithm using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different types of OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks.  ( 2 min )
    From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses. (arXiv:2205.07704v2 [stat.ML] UPDATED)
    We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $\Omega(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).  ( 2 min )
    Learning to Estimate and Refine Fluid Motion with Physical Dynamics. (arXiv:2206.10480v2 [cs.LG] UPDATED)
    Extracting information on fluid motion directly from images is challenging. Fluid flow represents a complex dynamic system governed by the Navier-Stokes equations. General optical flow methods are typically designed for rigid body motion, and thus struggle if applied to fluid motion estimation directly. Further, optical flow methods only focus on two consecutive frames without utilising historical temporal information, while the fluid motion (velocity field) can be considered a continuous trajectory constrained by time-dependent partial differential equations (PDEs). This discrepancy has the potential to induce physically inconsistent estimations. Here we propose an unsupervised learning based prediction-correction scheme for fluid flow estimation. An estimate is first given by a PDE-constrained optical flow predictor, which is then refined by a physical based corrector. The proposed approach outperforms optical flow methods and shows competitive results compared to existing supervised learning based methods on a benchmark dataset. Furthermore, the proposed approach can generalize to complex real-world fluid scenarios where ground truth information is effectively unknowable. Finally, experiments demonstrate that the physical corrector can refine flow estimates by mimicking the operator splitting method commonly utilised in fluid dynamical simulation.  ( 2 min )
    Business Document Information Extraction: Towards Practical Benchmarks. (arXiv:2206.11229v1 [cs.IR])
    Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.  ( 2 min )
    FedorAS: Federated Architecture Search under system heterogeneity. (arXiv:2206.11239v1 [cs.LG])
    Federated learning (FL) has recently gained considerable attention due to its ability to use decentralised data while preserving privacy. However, it also poses additional challenges related to the heterogeneity of the participating devices, both in terms of their computational capabilities and contributed data. Meanwhile, Neural Architecture Search (NAS) has been successfully used with centralised datasets, producing state-of-the-art results in constrained (hardware-aware) and unconstrained settings. However, even the most recent work laying at the intersection of NAS and FL assumes homogeneous compute environment with datacenter-grade hardware and does not address the issues of working with constrained, heterogeneous devices. As a result, practical usage of NAS in a federated setting remains an open problem that we address in our work. We design our system, FedorAS, to discover and train promising architectures when dealing with devices of varying capabilities holding non-IID distributed data, and present empirical evidence of its effectiveness across different settings. Specifically, we evaluate FedorAS across datasets spanning three different modalities (vision, speech, text) and show its better performance compared to state-of-the-art federated solutions, while maintaining resource efficiency.  ( 2 min )
    Inference of Multiscale Gaussian Graphical Model. (arXiv:2202.05775v2 [stat.ML] UPDATED)
    Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.  ( 2 min )
    Private and polynomial time algorithms for learning Gaussians and beyond. (arXiv:2111.11320v3 [stat.ML] UPDATED)
    We present a fairly general framework for reducing $(\varepsilon, \delta)$ differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\widetilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \varepsilon + d\ln(1/\delta) / \alpha \varepsilon)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath (ALT'21). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman (arXiv:2111.04609) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$. As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\widetilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker (arXiv:2112.03548) also provided a polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\widetilde{O}(d^8)$.  ( 2 min )
    Adversarially trained neural representations may already be as robust as corresponding biological neural representations. (arXiv:2206.11228v1 [q-bio.NC])
    Visual systems of primates are the gold standard of robust perception. There is thus a general belief that mimicking the neural representations that underlie those systems will yield artificial visual systems that are adversarially robust. In this work, we develop a method for performing adversarial visual attacks directly on primate brain activity. We then leverage this method to demonstrate that the above-mentioned belief might not be well founded. Specifically, we report that the biological neurons that make up visual systems of primates exhibit susceptibility to adversarial perturbations that is comparable in magnitude to existing (robustly trained) artificial neural networks.  ( 2 min )
    X-Risk Analysis for AI Research. (arXiv:2206.05862v3 [cs.CY] CROSS LISTED)
    Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more intelligent and powerful AI systems could eventually result in systems that are more powerful than us; some say this is like playing with fire and speculate that this could create existential risks (x-risks). To add precision and ground these discussions, we provide a guide for how to analyze AI x-risk, which consists of three parts: First, we review how systems can be made safer today, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions. Next, we discuss strategies for having long-term impacts on the safety of future systems. Finally, we discuss a crucial concept in making AI systems safer by improving the balance between safety and general capabilities. We hope this document and the presented concepts and tools serve as a useful guide for understanding how to analyze AI x-risk.
    Introduction to Machine Learning for the Sciences. (arXiv:2102.04883v2 [physics.comp-ph] UPDATED)
    This is an introductory machine-learning course specifically developed with STEM students in mind. Our goal is to provide the interested reader with the basics to employ machine learning in their own projects and to familiarize themself with the terminology as a foundation for further reading of the relevant literature. In these lecture notes, we discuss supervised, unsupervised, and reinforcement learning. The notes start with an exposition of machine learning methods without neural networks, such as principle component analysis, t-SNE, clustering, as well as linear regression and linear classifiers. We continue with an introduction to both basic and advanced neural-network structures such as dense feed-forward and conventional neural networks, recurrent neural networks, restricted Boltzmann machines, (variational) autoencoders, generative adversarial networks. Questions of interpretability are discussed for latent-space representations and using the examples of dreaming and adversarial attacks. The final section is dedicated to reinforcement learning, where we introduce basic notions of value functions and policy learning.
    Encoding large information structures in linear algebra and statistical models. (arXiv:2201.08233v3 [cs.LG] UPDATED)
    Large information sizes in samples and features can be encoded to speed up the learning of statistical models based on linear algebra and remove unwanted signals. Encoding information can reduce both sample and feature dimension to a smaller representational set. Here two examples are shown on linear mixed models and mixture models speeding up the run time for parameter estimation by a factor defined by the user's choice on dimension reduction (can be linear, quadratic or beyond based on dimension specification).
    Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements. (arXiv:2104.14526v3 [cs.LG] UPDATED)
    Tensors, which provide a powerful and flexible model for representing multi-attribute data and multi-way interactions, play an indispensable role in modern data science across various fields in science and engineering. A fundamental task is to faithfully recover the tensor from highly incomplete measurements in a statistically and computationally efficient manner. Harnessing the low-rank structure of tensors in the Tucker decomposition, this paper develops a scaled gradient descent (ScaledGD) algorithm to directly recover the tensor factors with tailored spectral initializations, and shows that it provably converges at a linear rate independent of the condition number of the ground truth tensor for two canonical problems -- tensor completion and tensor regression -- as soon as the sample size is above the order of $n^{3/2}$ ignoring other parameter dependencies, where $n$ is the dimension of the tensor. This leads to an extremely scalable approach to low-rank tensor estimation compared with prior art, which suffers from at least one of the following drawbacks: extreme sensitivity to ill-conditioning, high per-iteration costs in terms of memory and computation, or poor sample complexity guarantees. To the best of our knowledge, ScaledGD is the first algorithm that achieves near-optimal statistical and computational complexities simultaneously for low-rank tensor completion with the Tucker decomposition. Our algorithm highlights the power of appropriate preconditioning in accelerating nonconvex statistical estimation, where the iteration-varying preconditioners promote desirable invariance properties of the trajectory with respect to the underlying symmetry in low-rank tensor factorization.
    Model-free Representation Learning and Exploration in Low-rank MDPs. (arXiv:2102.07035v2 [cs.LG] UPDATED)
    The low rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.
    Convergence Rates for Learning Linear Operators from Noisy Data. (arXiv:2108.12515v2 [math.ST] UPDATED)
    This paper studies the learning of linear operators between infinite-dimensional Hilbert spaces. The training data comprises pairs of random input vectors in a Hilbert space and their noisy images under an unknown self-adjoint linear operator. Assuming that the operator is diagonalizable in a known basis, this work solves the equivalent inverse problem of estimating the operator's eigenvalues given the data. Adopting a Bayesian approach, the theoretical analysis establishes posterior contraction rates in the infinite data limit with Gaussian priors that are not directly linked to the forward map of the inverse problem. The main results also include learning-theoretic generalization error guarantees for a wide range of distribution shifts. These convergence rates quantify the effects of data smoothness and true eigenvalue decay or growth, for compact or unbounded operators, respectively, on sample complexity. Numerical evidence supports the theory in diagonal and non-diagonal settings.
    MMD Aggregated Two-Sample Test. (arXiv:2110.15073v2 [stat.ML] UPDATED)
    We propose a novel nonparametric two-sample test based on the Maximum Mean Discrepancy (MMD), which is constructed by aggregating tests with different kernel bandwidths. This aggregation procedure, called MMDAgg, ensures that test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We work in the non-asymptotic framework, and prove that our aggregated test is minimax adaptive over Sobolev balls. Our guarantees are not restricted to a specific kernel, but hold for any product of one-dimensional translation invariant characteristic kernels which are absolutely and square integrable. Moreover, our results apply for popular numerical procedures to determine the test threshold, namely permutations and the wild bootstrap. Through numerical experiments on both synthetic and real-world datasets, we demonstrate that MMDAgg outperforms alternative state-of-the-art approaches to MMD kernel adaptation for two-sample testing.
    Coin Flipping Neural Networks. (arXiv:2206.09182v2 [cs.LG] UPDATED)
    We show that neural networks with access to randomness can outperform deterministic networks by using amplification. We call such networks Coin-Flipping Neural Networks, or CFNNs. We show that a CFNN can approximate the indicator of a $d$-dimensional ball to arbitrary accuracy with only 2 layers and $\mathcal{O}(1)$ neurons, where a 2-layer deterministic network was shown to require $\Omega(e^d)$ neurons, an exponential improvement (arXiv:1610.09887). We prove a highly non-trivial result, that for almost any classification problem, there exists a trivially simple network that solves it given a sufficiently powerful generator for the network's weights. Combining these results we conjecture that for most classification problems, there is a CFNN which solves them with higher accuracy or fewer neurons than any deterministic network. Finally, we verify our proofs experimentally using novel CFNN architectures on CIFAR10 and CIFAR100, reaching an improvement of 9.25\% from the baseline.
    Automatic Autism Spectrum Disorder Detection Using Artificial Intelligence Methods with MRI Neuroimaging: A Review. (arXiv:2206.11233v1 [q-bio.NC])
    Autism spectrum disorder (ASD) is a brain condition characterized by diverse signs and symptoms that appear in early childhood. ASD is also associated with communication deficits and repetitive behavior in affected individuals. Various ASD detection methods have been developed, including neuroimaging modalities and psychological tests. Among these methods, magnetic resonance imaging (MRI) imaging modalities are of paramount importance to physicians. Clinicians rely on MRI modalities to diagnose ASD accurately. The MRI modalities are non-invasive methods that include functional (fMRI) and structural (sMRI) neuroimaging methods. However, the process of diagnosing ASD with fMRI and sMRI for specialists is often laborious and time-consuming; therefore, several computer-aided design systems (CADS) based on artificial intelligence (AI) have been developed to assist the specialist physicians. Conventional machine learning (ML) and deep learning (DL) are the most popular schemes of AI used for diagnosing ASD. This study aims to review the automated detection of ASD using AI. We review several CADS that have been developed using ML techniques for the automated diagnosis of ASD using MRI modalities. There has been very limited work on the use of DL techniques to develop automated diagnostic models for ASD. A summary of the studies developed using DL is provided in the appendix. Then, the challenges encountered during the automated diagnosis of ASD using MRI and AI techniques are described in detail. Additionally, a graphical comparison of studies using ML and DL to diagnose ASD automatically is discussed. We conclude by suggesting future approaches to detecting ASDs using AI techniques and MRI neuroimaging.
    Dual-Stream Transformer with Cross-Attention on Whole-Slide Image Pyramids for Cancer Prognosis. (arXiv:2206.05782v2 [eess.IV] UPDATED)
    The cancer prognosis on gigapixel Whole-Slide Images (WSIs) has always been a challenging task. Most existing approaches focus solely on single-resolution images. The multi-resolution schemes, utilizing image pyramids to enhance WSI visual representations, have not yet been paid enough attention to. In order to explore a multi-resolution solution for improving cancer prognosis accuracy, this paper proposes a dual-stream architecture to model WSIs by an image pyramid strategy. This architecture consists of two sub-streams: one for low-resolution WSIs, and the other especially for high-resolution ones. Compared to other approaches, our scheme has three highlights: (i) there exists a one-to-one relation between stream and resolution; (ii) a square pooling layer is added to align the patches from two resolution streams, largely reducing computation cost and enabling a natural stream feature fusion; (iii) a cross-attention-based method is proposed to pool high-resolution patches spatially under the guidance of low-resolution ones. We validate our scheme on three publicly-available datasets with a total number of 3,101 WSIs from 1,911 patients. Experimental results verify that (i) hierarchical dual-stream representation is more effective than single-stream ones for cancer prognosis, gaining an average C-Index rise of 5.0% and 1.8% on a single low-resolution and high-resolution stream, respectively; (ii) our dual-stream scheme could outperform current state-of-the-art ones, by an average C-Index improvement of 5.1%; (iii) the cancer diseases with observable survival differences could have different preferences for model complexity. Our scheme could serve as an alternative tool for further facilitating WSI prognosis research.
    Adaptive Adversarial Training to Improve Adversarial Robustness of DNNs for Medical Image Segmentation and Detection. (arXiv:2206.01736v2 [eess.IV] UPDATED)
    It is known that Deep Neural networks (DNNs) are vulnerable to adversarial attacks, and the adversarial robustness of DNNs could be improved by adding adversarial noises to training data (e.g., the standard adversarial training (SAT)). However, inappropriate noises added to training data may reduce a model's performance, which is termed the trade-off between accuracy and robustness. This problem has been sufficiently studied for the classification of whole images but has rarely been explored for image analysis tasks in the medical application domain, including image segmentation, landmark detection, and object detection tasks. In this study, we show that, for those medical image analysis tasks, the SAT method has a severe issue that limits its practical use: it generates a fixed and unified level of noise for all training samples for robust DNN training. A high noise level may lead to a large reduction in model performance and a low noise level may not be effective in improving robustness. To resolve this issue, we design an adaptive-margin adversarial training (AMAT) method that generates sample-wise adaptive adversarial noises for robust DNN training. In contrast to the existing, classification-oriented adversarial training methods, our AMAT method uses a loss-defined-margin strategy so that it can be applied to different tasks as long as the loss functions are well-defined. We successfully apply our AMAT method to state-of-the-art DNNs, using five publicly available datasets. The experimental results demonstrate that: (1) our AMAT method can be applied to the three seemingly different tasks in the medical image application domain; (2) AMAT outperforms the SAT method in adversarial robustness; (3) AMAT has a minimal reduction in prediction accuracy on clean data, compared with the SAT method; and (4) AMAT has almost the same training time cost as SAT.
    Traffic-Twitter Transformer: A Nature Language Processing-joined Framework For Network-wide Traffic Forecasting. (arXiv:2206.11078v1 [cs.LG])
    With accurate and timely traffic forecasting, the impacted traffic conditions can be predicted in advance to guide agencies and residents to respond to changes in traffic patterns appropriately. However, existing works on traffic forecasting mainly relied on historical traffic patterns confining to short-term prediction, under 1 hour, for instance. To better manage future roadway capacity and accommodate social and human impacts, it is crucial to propose a flexible and comprehensive framework to predict physical-aware long-term traffic conditions for public users and transportation agencies. In this paper, the gap of robust long-term traffic forecasting was bridged by taking social media features into consideration. A correlation study and a linear regression model were first implemented to evaluate the significance of the correlation between two time-series data, traffic intensity and Twitter data intensity. Two time-series data were then fed into our proposed social-aware framework, Traffic-Twitter Transformer, which integrated Nature Language representations into time-series records for long-term traffic prediction. Experimental results in the Great Seattle Area showed that our proposed model outperformed baseline models in all evaluation matrices. This NLP-joined social-aware framework can become a valuable implement of network-wide traffic prediction and management for traffic agencies.
    Neural Moving Horizon Estimation for Robust Flight Control. (arXiv:2206.10397v2 [cs.RO] UPDATED)
    Estimating and reacting to external disturbances is crucial for robust flight control of quadrotors. Existing estimators typically require significant tuning for a specific flight scenario or training with extensive real-world data to achieve satisfactory performance. In this paper, we propose a neural moving horizon estimator (NeuroMHE) that can automatically tune the MHE parameters modeled by a neural network and adapt to different flight scenarios. We achieve this by deriving the analytical gradient of the MHE estimates with respect to the tunable parameters, enabling a seamless embedding of MHE as a layer into the neural network for highly effective learning. Most interestingly, we show that the gradient can be solved efficiently from a Kalman filter in a recursive form. Moreover, we develop a model-based policy gradient algorithm to train NeuroMHE directly from the trajectory tracking error without the need for the ground-truth disturbance. The effectiveness of NeuroMHE is verified extensively via both simulations and physical experiments on a quadrotor in various challenging flights. Notably, NeuroMHE outperforms the state-of-the-art estimator with force estimation error reductions of up to 49.4% by using only a 2.5% amount of parameters. The proposed method is general and can be applied to robust adaptive control for other robotic systems.
    Supervised Learning for Coverage-Directed Test Selection in Simulation-Based Verification. (arXiv:2205.08524v2 [cs.AR] UPDATED)
    Constrained random test generation is one of the most widely adopted methods for generating stimuli for simulation-based verification. Randomness leads to test diversity, but tests tend to repeatedly exercise the same design logic. Constraints are written (typically manually) to bias random tests towards interesting, hard-to-reach, and yet-untested logic. However, as verification progresses, most constrained random tests yield little to no effect on functional coverage. If stimuli generation consumes significantly less resources than simulation, then a better approach involves randomly generating a large number of tests, selecting the most effective subset, and only simulating that subset. In this paper, we introduce a novel method for automatic constraint extraction and test selection. This method, which we call coverage-directed test selection, is based on supervised learning from coverage feedback. Our method biases selection towards tests that have a high probability of increasing functional coverage, and prioritises them for simulation. We show how coverage-directed test selection can reduce manual constraint writing, prioritise effective tests, reduce verification resource consumption, and accelerate coverage closure on a large, real-life industrial hardware design.
    Near-optimal control of dynamical systems with neural ordinary differential equations. (arXiv:2206.11120v1 [cs.LG])
    Optimal control problems naturally arise in many scientific applications where one wishes to steer a dynamical system from a certain initial state $\mathbf{x}_0$ to a desired target state $\mathbf{x}^*$ in finite time $T$. Recent advances in deep learning and neural network-based optimization have contributed to the development of methods that can help solve control problems involving high-dimensional dynamical systems. In particular, the framework of neural ordinary differential equations (neural ODEs) provides an efficient means to iteratively approximate continuous time control functions associated with analytically intractable and computationally demanding control tasks. Although neural ODE controllers have shown great potential in solving complex control problems, the understanding of the effects of hyperparameters such as network structure and optimizers on learning performance is still very limited. Our work aims at addressing some of these knowledge gaps to conduct efficient hyperparameter optimization. To this end, we first analyze how truncated and non-truncated backpropagation through time affect runtime performance and the ability of neural networks to learn optimal control functions. Using analytical and numerical methods, we then study the role of parameter initializations, optimizers, and neural-network architecture. Finally, we connect our results to the ability of neural ODE controllers to implicitly regularize control energy.  ( 2 min )
    Behavior Transformers: Cloning $k$ modes with one stone. (arXiv:2206.11251v1 [cs.LG])
    While behavior learning has made impressive progress in recent times, it lags behind computer vision and natural language processing due to its inability to leverage large, human-generated datasets. Human behaviors have wide variance, multiple modes, and human demonstrations typically do not come with reward labels. These properties limit the applicability of current methods in Offline RL and Behavioral Cloning to learn from large, pre-collected datasets. In this work, we present Behavior Transformer (BeT), a new technique to model unlabeled demonstration data with multiple modes. BeT retrofits standard transformer architectures with action discretization coupled with a multi-task action correction inspired by offset prediction in object detection. This allows us to leverage the multi-modal modeling ability of modern transformers to predict multi-modal continuous actions. We experimentally evaluate BeT on a variety of robotic manipulation and self-driving behavior datasets. We show that BeT significantly improves over prior state-of-the-art work on solving demonstrated tasks while capturing the major modes present in the pre-collected datasets. Finally, through an extensive ablation study, we analyze the importance of every crucial component in BeT. Videos of behavior generated by BeT are available at https://notmahi.github.io/bet  ( 2 min )
    Discussion of `Multiscale Fisher's Independence Test for Multivariate Dependence'. (arXiv:2206.11142v1 [stat.ME])
    We discuss how MultiFIT, the Multiscale Fisher's Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing linear-time kernel tests based on the Hilbert-Schmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as it is the case with the level of MultiFIT. In our experiments, we observe some of the performance limitations of MultiFIT in terms of test power.  ( 2 min )
    MRI Reconstruction via Data Driven Markov Chain with Joint Uncertainty Estimation. (arXiv:2202.01479v2 [cs.LG] UPDATED)
    We introduce a framework that enables efficient sampling from learned probability distributions for MRI reconstruction. Different from conventional deep learning-based MRI reconstruction techniques, samples are drawn from the posterior distribution given the measured k-space using the Markov chain Monte Carlo (MCMC) method. In addition to the maximum a posteriori (MAP) estimate for the image, which can be obtained with conventional methods, the minimum mean square error (MMSE) estimate and uncertainty maps can also be computed. The data-driven Markov chains are constructed from the generative model learned from a given image database and are independent of the forward operator that is used to model the k-space measurement. This provides flexibility because the method can be applied to k-space acquired with different sampling schemes or receive coils using the same pre-trained models. Furthermore, we use a framework based on a reverse diffusion process to be able to utilize advanced generative models. The performance of the method is evaluated on an open dataset using 10-fold undersampling in k-space.  ( 2 min )
    $k$-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers. (arXiv:2102.04763v2 [cs.LG] UPDATED)
    The protection of private information is a crucial issue in data-driven research and business contexts. Typically, techniques like anonymisation or (selective) deletion are introduced in order to allow data sharing, e. g. in the case of collaborative research endeavours. For use with anonymisation techniques, the $k$-anonymity criterion is one of the most popular, with numerous scientific publications on different algorithms and metrics. Anonymisation techniques often require changing the data and thus necessarily affect the results of machine learning models trained on the underlying data. In this work, we conduct a systematic comparison and detailed investigation into the effects of different $k$-anonymisation algorithms on the results of machine learning models. We investigate a set of popular $k$-anonymisation algorithms with different classifiers and evaluate them on different real-world datasets. Our systematic evaluation shows that with an increasingly strong $k$-anonymity constraint, the classification performance generally degrades, but to varying degrees and strongly depending on the dataset and anonymisation method. Furthermore, Mondrian can be considered as the method with the most appealing properties for subsequent classification.  ( 2 min )
    Kernel Clustering with Sigmoid-based Regularization for Efficient Segmentation of Sequential Data. (arXiv:2106.11541v2 [cs.LG] UPDATED)
    Kernel segmentation aims at partitioning a data sequence into several non-overlapping segments that may have nonlinear and complex structures. In general, it is formulated as a discrete optimization problem with combinatorial constraints. A popular algorithm for optimally solving this problem is dynamic programming (DP), which has quadratic computation and memory requirements. Given that sequences in practice are too long, this algorithm is not a practical approach. Although many heuristic algorithms have been proposed to approximate the optimal segmentation, they have no guarantee on the quality of their solutions. In this paper, we take a differentiable approach to alleviate the aforementioned issues. First, we introduce a novel sigmoid-based regularization to smoothly approximate the combinatorial constraints. Combining it with objective of the balanced kernel clustering, we formulate a differentiable model termed Kernel clustering with sigmoid-based regularization (KCSR), where the gradient-based algorithm can be exploited to obtain the optimal segmentation. Second, we develop a stochastic variant of the proposed model. By using the stochastic gradient descent algorithm, which has much lower time and space complexities, for optimization, the second model can perform segmentation on overlong data sequences. Finally, for simultaneously segmenting multiple data sequences, we slightly modify the sigmoid-based regularization to further introduce an extended variant of the proposed model. Through extensive experiments on various types of data sequences performances of our models are evaluated and compared with those of the existing methods. The experimental results validate advantages of the proposed models. Our Matlab source code is available on github.  ( 3 min )
    Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting. (arXiv:2206.09112v2 [cs.LG] UPDATED)
    We all depend on mobility, and vehicular transportation affects the daily lives of most of us. Thus, the ability to forecast the state of traffic in a road network is an important functionality and a challenging task. Traffic data is often obtained from sensors deployed in a road network. Recent proposals on spatial-temporal graph neural networks have achieved great progress at modeling complex spatial-temporal correlations in traffic data, by modeling traffic data as a diffusion process. However, intuitively, traffic data encompasses two different kinds of hidden time series signals, namely the diffusion signals and inherent signals. Unfortunately, nearly all previous works coarsely consider traffic signals entirely as the outcome of the diffusion, while neglecting the inherent signals, which impacts model performance negatively. To improve modeling performance, we propose a novel Decoupled Spatial-Temporal Framework (DSTF) that separates the diffusion and inherent traffic information in a data-driven manner, which encompasses a unique estimation gate and a residual decomposition mechanism. The separated signals can be handled subsequently by the diffusion and inherent modules separately. Further, we propose an instantiation of DSTF, Decoupled Dynamic Spatial-Temporal Graph Neural Network (D2STGNN), that captures spatial-temporal correlations and also features a dynamic graph learning module that targets the learning of the dynamic characteristics of traffic networks. Extensive experiments with four real-world traffic datasets demonstrate that the framework is capable of advancing the state-of-the-art.  ( 3 min )
    MedFilter: Improving Extraction of Task-relevant Utterances from Doctor-Patient Conversations through Integration of Discourse Structure and Ontological Knowledge. (arXiv:2010.02246v3 [cs.CL] UPDATED)
    Information extraction from conversational data is particularly challenging because the task-centric nature of conversation allows for effective communication of implicit information by humans, but is challenging for machines. The challenges may differ between utterances depending on the role of the speaker within the conversation, especially when relevant expertise is distributed asymmetrically across roles. Further, the challenges may also increase over the conversation as more shared context is built up through information communicated implicitly earlier in the dialogue. In this paper, we propose the novel modeling approach MedFilter, which addresses these insights in order to increase performance at identifying and categorizing task-relevant utterances, and in so doing, positively impacts performance at a downstream information extraction task. We evaluate this approach on a corpus of nearly 7,000 doctor-patient conversations where MedFilter is used to identify medically relevant contributions to the discussion (achieving a 10% improvement over SOTA baselines in terms of area under the PR curve). Identifying task-relevant utterances benefits downstream medical processing, achieving improvements of 15%, 105%, and 23% respectively for the extraction of symptoms, medications, and complaints.  ( 2 min )
    Beyond RMSE: Do machine-learned models of road user interaction produce human-like behavior?. (arXiv:2206.11110v1 [cs.LG])
    Autonomous vehicles use a variety of sensors and machine-learned models to predict the behavior of surrounding road users. Most of the machine-learned models in the literature focus on quantitative error metrics like the root mean square error (RMSE) to learn and report their models' capabilities. This focus on quantitative error metrics tends to ignore the more important behavioral aspect of the models, raising the question of whether these models really predict human-like behavior. Thus, we propose to analyze the output of machine-learned models much like we would analyze human data in conventional behavioral research. We introduce quantitative metrics to demonstrate presence of three different behavioral phenomena in a naturalistic highway driving dataset: 1) The kinematics-dependence of who passes a merging point first 2) Lane change by an on-highway vehicle to accommodate an on-ramp vehicle 3) Lane changes by vehicles on the highway to avoid lead vehicle conflicts. Then, we analyze the behavior of three machine-learned models using the same metrics. Even though the models' RMSE value differed, all the models captured the kinematic-dependent merging behavior but struggled at varying degrees to capture the more nuanced courtesy lane change and highway lane change behavior. Additionally, the collision aversion analysis during lane changes showed that the models struggled to capture the physical aspect of human driving: leaving adequate gap between the vehicles. Thus, our analysis highlighted the inadequacy of simple quantitative metrics and the need to take a broader behavioral perspective when analyzing machine-learned models of human driving predictions.
    Minimizing Control for Credit Assignment with Strong Feedback. (arXiv:2204.07249v2 [cs.NE] UPDATED)
    The success of deep learning ignited interest in whether the brain learns hierarchical representations using gradient-based learning. However, current biologically plausible methods for gradient-based credit assignment in deep neural networks need infinitesimally small feedback signals, which is problematic in biologically realistic noisy environments and at odds with experimental evidence in neuroscience showing that top-down feedback can significantly influence neural activity. Building upon deep feedback control (DFC), a recently proposed credit assignment method, we combine strong feedback influences on neural activity with gradient-based learning and show that this naturally leads to a novel view on neural network optimization. Instead of gradually changing the network weights towards configurations with low output loss, weight updates gradually minimize the amount of feedback required from a controller that drives the network to the supervised output label. Moreover, we show that the use of strong feedback in DFC allows learning forward and feedback connections simultaneously, using learning rules fully local in space and time. We complement our theoretical results with experiments on standard computer-vision benchmarks, showing competitive performance to backpropagation as well as robustness to noise. Overall, our work presents a fundamentally novel view of learning as control minimization, while sidestepping biologically unrealistic assumptions.
    A Novel Three-Dimensional Navigation Method for the Visually Impaired. (arXiv:2206.11136v1 [cs.HC])
    According to the World Health Organization, visual impairment is estimated to affect approximately 2.2 billion people worldwide. The visually impaired must currently rely on navigational aids to replace their sense of sight, like a white cane or GPS (Global Positioning System) based navigation, both of which fail to work well indoors. The white cane cannot be used to determine a user's position within a room, while GPS can often lose connection indoors and does not provide orientation information, making both approaches unsuitable for indoor use. Therefore, this research seeks to develop a 3D-imaging solution that enables contactless navigation through a complex indoor environment. The device can pinpoint a user's position and orientation with 31% less error compared to previous approaches while requiring only 53.1% of the memory, and processing 125% faster. The device can also detect obstacles with 60.2% more accuracy than the previous state-of-the-art models while requiring only 41% of the memory and processing 260% faster. When testing with human participants, the device allows for a 94.5% reduction in collisions with obstacles in the environment and allows for a 48.3% increase in walking speed, showing that my device enables safer and more rapid navigation for the visually impaired. All in all, this research demonstrates a 3D-based navigation system for the visually impaired. The approach can be used by a wide variety of mobile low-power devices, like cell phones, ensuring this research remains accessible to all.  ( 2 min )
    Optimal transport meets noisy label robust loss and MixUp regularization for domain adaptation. (arXiv:2206.11180v1 [cs.CV])
    It is common in computer vision to be confronted with domain shift: images which have the same class but different acquisition conditions. In domain adaptation (DA), one wants to classify unlabeled target images using source labeled images. Unfortunately, deep neural networks trained on a source training set perform poorly on target images which do not belong to the training domain. One strategy to improve these performances is to align the source and target image distributions in an embedded space using optimal transport (OT). However OT can cause negative transfer, i.e. aligning samples with different labels, which leads to overfitting especially in the presence of label shift between domains. In this work, we mitigate negative alignment by explaining it as a noisy label assignment to target images. We then mitigate its effect by appropriate regularization. We propose to couple the MixUp regularization \citep{zhang2018mixup} with a loss that is robust to noisy labels in order to improve domain adaptation performance. We show in an extensive ablation study that a combination of the two techniques is critical to achieve improved performance. Finally, we evaluate our method, called \textsc{mixunbot}, on several benchmarks and real-world DA problems.  ( 2 min )
    VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives. (arXiv:2206.11212v1 [cs.CV])
    Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model explanation techniques) with human annotations such as highlights of important image regions. However, recent work has shown that performance gains from feature importance (FI) supervision for Visual Question Answering (VQA) tasks persist even with random supervision, suggesting that these methods do not meaningfully align model FI with human FI. In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility). Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets in terms of both in-distribution and out-of-distribution accuracy. While past work suggests that the mechanism for improved accuracy is through improved explanation plausibility, we show that this relationship depends crucially on explanation faithfulness (whether explanations truly represent the model's internal reasoning). Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful. Lastly, we show that, surprisingly, RRR metrics are not predictive of out-of-distribution model accuracy when controlling for a model's in-distribution accuracy, which calls into question the value of these metrics for evaluating model reasoning. All supporting code is available at https://github.com/zfying/visfis  ( 3 min )
    On the Role of Spatial, Spectral, and Temporal Processing for DNN-based Non-linear Multi-channel Speech Enhancement. (arXiv:2206.11181v1 [eess.AS])
    Employing deep neural networks (DNNs) to directly learn filters for multi-channel speech enhancement has potentially two key advantages over a traditional approach combining a linear spatial filter with an independent tempo-spectral post-filter: 1) non-linear spatial filtering allows to overcome potential restrictions originating from a linear processing model and 2) joint processing of spatial and tempo-spectral information allows to exploit interdependencies between different sources of information. A variety of DNN-based non-linear filters have been proposed recently, for which good enhancement performance is reported. However, little is known about the internal mechanisms which turns network architecture design into a game of chance. Therefore, in this paper, we perform experiments to better understand the internal processing of spatial, spectral and temporal information by DNN-based non-linear filters. On the one hand, our experiments in a difficult speech extraction scenario confirm the importance of non-linear spatial filtering, which outperforms an oracle linear spatial filter by 0.24 POLQA score. On the other hand, we demonstrate that joint processing results in a large performance gap of 0.4 POLQA score between network architectures exploiting spectral versus temporal information besides spatial information.  ( 2 min )
    Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. (arXiv:2203.05482v2 [cs.LG] UPDATED)
    The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.  ( 3 min )
    Then and Now: Quantifying the Longitudinal Validity of Self-Disclosed Depression Diagnoses. (arXiv:2206.11155v1 [cs.LG])
    Self-disclosed mental health diagnoses, which serve as ground truth annotations of mental health status in the absence of clinical measures, underpin the conclusions behind most computational studies of mental health language from the last decade. However, psychiatric conditions are dynamic; a prior depression diagnosis may no longer be indicative of an individual's mental health, either due to treatment or other mitigating factors. We ask: to what extent are self-disclosures of mental health diagnoses actually relevant over time? We analyze recent activity from individuals who disclosed a depression diagnosis on social media over five years ago and, in turn, acquire a new understanding of how presentations of mental health status on social media manifest longitudinally. We also provide expanded evidence for the presence of personality-related biases in datasets curated using self-disclosed diagnoses. Our findings motivate three practical recommendations for improving mental health datasets curated using self-disclosed diagnoses: 1) Annotate diagnosis dates and psychiatric comorbidities; 2) Sample control groups using propensity score matching; 3) Identify and remove spurious correlations introduced by selection bias.  ( 2 min )
    Contextual Semantic Embeddings for Ontology Subsumption Prediction. (arXiv:2112.10006v4 [cs.LG] UPDATED)
    Automating ontology construction and curation is an important but challenging task in knowledge engineering and artificial intelligence. Prediction by machine learning techniques such as contextual semantic embedding is a promising direction, but the relevant research is still preliminary especially for expressive ontologies in Web Ontology Language (OWL). In this paper, we present a new subsumption prediction method named BERTSubs for classes of OWL ontology. It exploits the pre-trained language model BERT to compute contextual embeddings of a class, where customized templates are proposed to incorporate the class context (e.g., neighbouring classes) and the logical existential restriction. BERTSubs is quite general, being able to predict multiple kinds of subsumers including named classes and existential restrictions from the same ontology or another ontology. Extensive evaluation on five real-world ontologies for three different subsumption tasks has shown the effectiveness of the templates and that BERTSubs can dramatically outperform the baselines that use (literal-aware) knowledge graph embeddings, non-contextual word embeddings and the state-of-the-art OWL ontology embeddings.
    Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks. (arXiv:2206.11057v1 [cs.LG])
    The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure representations yields better representations for function-related prediction tasks. We propose a transformer neural network that attends to both sequence and tertiary structure. We show that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics.  ( 2 min )
    Explanation-based Counterfactual Retraining(XCR): A Calibration Method for Black-box Models. (arXiv:2206.11126v1 [cs.LG])
    With the rapid development of eXplainable Artificial Intelligence (XAI), a long line of past work has shown concerns about the Out-of-Distribution (OOD) problem in perturbation-based post-hoc XAI models and explanations are socially misaligned. We explore the limitations of post-hoc explanation methods that use approximators to mimic the behavior of black-box models. Then we propose eXplanation-based Counterfactual Retraining (XCR), which extracts feature importance fastly. XCR applies the explanations generated by the XAI model as counterfactual input to retrain the black-box model to address OOD and social misalignment problems. Evaluation of popular image datasets shows that XCR can improve model performance when only retaining 12.5% of the most crucial features without changing the black-box model structure. Furthermore, the evaluation of the benchmark of corruption datasets shows that the XCR is very helpful for improving model robustness and positively impacts the calibration of OOD problems. Even though not calibrated in the validation set like some OOD calibration methods, the corrupted data metric outperforms existing methods. Our method also beats current OOD calibration methods on the OOD calibration metric if calibration on the validation set is applied.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v3 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Noisy $\ell^{0}$-Sparse Subspace Clustering on Dimensionality Reduced Data. (arXiv:2206.11079v1 [stat.ML])
    Sparse subspace clustering methods with sparsity induced by $\ell^{0}$-norm, such as $\ell^{0}$-Sparse Subspace Clustering ($\ell^{0}$-SSC)~\citep{YangFJYH16-L0SSC-ijcv}, are demonstrated to be more effective than its $\ell^{1}$ counterpart such as Sparse Subspace Clustering (SSC)~\citep{ElhamifarV13}. However, the theoretical analysis of $\ell^{0}$-SSC is restricted to clean data that lie exactly in subspaces. Real data often suffer from noise and they may lie close to subspaces. In this paper, we show that an optimal solution to the optimization problem of noisy $\ell^{0}$-SSC achieves subspace detection property (SDP), a key element with which data from different subspaces are separated, under deterministic and semi-random model. Our results provide theoretical guarantee on the correctness of noisy $\ell^{0}$-SSC in terms of SDP on noisy data for the first time, which reveals the advantage of noisy $\ell^{0}$-SSC in terms of much less restrictive condition on subspace affinity. In order to improve the efficiency of noisy $\ell^{0}$-SSC, we propose Noisy-DR-$\ell^{0}$-SSC which provably recovers the subspaces on dimensionality reduced data. Noisy-DR-$\ell^{0}$-SSC first projects the data onto a lower dimensional space by random projection, then performs noisy $\ell^{0}$-SSC on the projected data for improved efficiency. Experimental results demonstrate the effectiveness of Noisy-DR-$\ell^{0}$-SSC.  ( 2 min )
    Least Squares Estimation Using Sketched Data with Heteroskedastic Errors. (arXiv:2007.07781v3 [stat.ML] UPDATED)
    Researchers may perform regressions using a sketch of data of size $m$ instead of the full sample of size $n$ for a variety of reasons. This paper considers the case when the regression errors do not have constant variance and heteroskedasticity robust standard errors would normally be needed for test statistics to provide accurate inference. We show that estimates using data sketched by random projections will behave `as if' the errors were homoskedastic. Estimation by random sampling would not have this property. The result arises because the sketched estimates in the case of random projections can be expressed as degenerate $U$-statistics, and under certain conditions, these statistics are asymptotically normal with homoskedastic variance. We verify that the conditions hold not only in the case of least squares regression when the covariates are exogenous, but also in instrumental variables estimation when the covariates are endogenous. The result implies that inference, including first-stage F tests for instrument relevance, can be simpler than the full sample case if the sketching scheme is appropriately chosen.  ( 2 min )
    Algorithms that get old : the case of generative deep neural networks. (arXiv:2202.03008v2 [stat.ML] UPDATED)
    Generative deep neural networks used in machine learning, like the Variational Auto-Encoders (VAE), and Generative Adversarial Networks (GANs) produce new objects each time when asked to do so with the constraint that the new objects remain similar to some list of examples given as input. However, this behavior is unlike that of human artists that change their style as times go by and seldom return to the initial creations. We investigate a situation where VAEs are used to sample from a probability measure described by some empirical dataset. Based on recent works on Radon-Sobolev statistical distances, we propose a numerical paradigm, to be used in conjunction with a generative algorithm, that satisfies the two following requirements: the objects created do not repeat and evolve to fill the entire target probability measure.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v3 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    Ordered Subgraph Aggregation Networks. (arXiv:2206.11168v1 [cs.LG])
    Numerous subgraph-enhanced graph neural networks (GNNs) have emerged recently, provably boosting the expressive power of standard (message-passing) GNNs. However, there is a limited understanding of how these approaches relate to each other and to the Weisfeiler--Leman hierarchy. Moreover, current approaches either use all subgraphs of a given size, sample them uniformly at random, or use hand-crafted heuristics instead of learning to select subgraphs in a data-driven manner. Here, we offer a unified way to study such architectures by introducing a theoretical framework and extending the known expressivity results of subgraph-enhanced GNNs. Concretely, we show that increasing subgraph size always increases the expressive power and develop a better understanding of their limitations by relating them to the established $k\text{-}\mathsf{WL}$ hierarchy. In addition, we explore different approaches for learning to sample subgraphs using recent methods for backpropagating through complex discrete probability distributions. Empirically, we study the predictive performance of different subgraph-enhanced GNNs, showing that our data-driven architectures increase prediction accuracy on standard benchmark datasets compared to non-data-driven subgraph-enhanced graph neural networks while reducing computation time.  ( 2 min )
    SMT-DTA: Improving Drug-Target Affinity Prediction with Semi-supervised Multi-task Training. (arXiv:2206.09818v2 [q-bio.BM] UPDATED)
    Drug-Target Affinity (DTA) prediction is an essential task for drug discovery and pharmaceutical research. Accurate predictions of DTA can greatly benefit the design of new drug. As wet experiments are costly and time consuming, the supervised data for DTA prediction is extremely limited. This seriously hinders the application of deep learning based methods, which require a large scale of supervised data. To address this challenge and improve the DTA prediction accuracy, we propose a framework with several simple yet effective strategies in this work: (1) a multi-task training strategy, which takes the DTA prediction and the masked language modeling (MLM) task on the paired drug-target dataset; (2) a semi-supervised training method to empower the drug and target representation learning by leveraging large-scale unpaired molecules and proteins in training, which differs from previous pre-training and fine-tuning methods that only utilize molecules or proteins in pre-training; and (3) a cross-attention module to enhance the interaction between drug and target representation. Extensive experiments are conducted on three real-world benchmark datasets: BindingDB, DAVIS and KIBA. The results show that our framework significantly outperforms existing methods and achieves state-of-the-art performances, e.g., $0.712$ RMSE on BindingDB IC$_{50}$ measurement with more than $5\%$ improvement than previous best work. In addition, case studies on specific drug-target binding activities, drug feature visualizations, and real-world applications demonstrate the great potential of our work. The code and data are released at https://github.com/QizhiPei/SMT-DTA  ( 3 min )
    Fast Aquatic Swimmer Optimization with Differentiable Projective Dynamics and Neural Network Hydrodynamic Models. (arXiv:2204.12584v2 [cs.RO] UPDATED)
    Aquatic locomotion is a classic fluid-structure interaction (FSI) problem of interest to biologists and engineers. Solving the fully coupled FSI equations for incompressible Navier-Stokes and finite elasticity is computationally expensive. Optimizing robotic swimmer design within such a system generally involves cumbersome, gradient-free procedures on top of the already costly simulation. To address this challenge we present a novel, fully differentiable hybrid approach to FSI that combines a 2D direct numerical simulation for the deformable solid structure of the swimmer and a physics-constrained neural network surrogate to capture hydrodynamic effects of the fluid. For the deformable solid simulation of the swimmer's body, we use state-of-the-art techniques from the field of computer graphics to speed up the finite-element method (FEM). For the fluid simulation, we use a U-Net architecture trained with a physics-based loss function to predict the flow field at each time step. The pressure and velocity field outputs from the neural network are sampled around the boundary of our swimmer using an immersed boundary method (IBM) to compute its swimming motion accurately and efficiently. We demonstrate the computational efficiency and differentiability of our hybrid simulator on a 2D carangiform swimmer. Due to differentiability, the simulator can be used for computational design of controls for soft bodies immersed in fluids via direct gradient-based optimization.  ( 3 min )
    Federated Adaptation of Reservoirs via Intrinsic Plasticity. (arXiv:2206.11087v1 [cs.NE])
    We propose a novel algorithm for performing federated learning with Echo State Networks (ESNs) in a client-server scenario. In particular, our proposal focuses on the adaptation of reservoirs by combining Intrinsic Plasticity with Federated Averaging. The former is a gradient-based method for adapting the reservoir's non-linearity in a local and unsupervised manner, while the latter provides the framework for learning in the federated scenario. We evaluate our approach on real-world datasets from human monitoring, in comparison with the previous approach for federated ESNs existing in literature. Results show that adapting the reservoir with our algorithm provides a significant improvement on the performance of the global model.  ( 2 min )
    RetrievalGuard: Provably Robust 1-Nearest Neighbor Image Retrieval. (arXiv:2206.11225v1 [cs.IR])
    Recent research works have shown that image retrieval models are vulnerable to adversarial attacks, where slightly modified test inputs could lead to problematic retrieval results. In this paper, we aim to design a provably robust image retrieval model which keeps the most important evaluation metric Recall@1 invariant to adversarial perturbation. We propose the first 1-nearest neighbor (NN) image retrieval algorithm, RetrievalGuard, which is provably robust against adversarial perturbations within an $\ell_2$ ball of calculable radius. The challenge is to design a provably robust algorithm that takes into consideration the 1-NN search and the high-dimensional nature of the embedding space. Algorithmically, given a base retrieval model and a query sample, we build a smoothed retrieval model by carefully analyzing the 1-NN search procedure in the high-dimensional embedding space. We show that the smoothed retrieval model has bounded Lipschitz constant and thus the retrieval score is invariant to $\ell_2$ adversarial perturbations. Experiments on image retrieval tasks validate the robustness of our RetrievalGuard method.  ( 2 min )
    Cold Posteriors through PAC-Bayes. (arXiv:2206.11173v1 [cs.LG])
    We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections between the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter $\lambda$ which is not restricted to be $\lambda=1$. For both regression and classification tasks, in the case of isotropic Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures the cold posterior effect.  ( 2 min )
    Multi-Modality Image Super-Resolution using Generative Adversarial Networks. (arXiv:2206.09193v2 [eess.IV] UPDATED)
    Over the past few years deep learning-based techniques such as Generative Adversarial Networks (GANs) have significantly improved solutions to image super-resolution and image-to-image translation problems. In this paper, we propose a solution to the joint problem of image super-resolution and multi-modality image-to-image translation. The problem can be stated as the recovery of a high-resolution image in a modality, given a low-resolution observation of the same image in an alternative modality. Our paper offers two models to address this problem and will be evaluated on the recovery of high-resolution day images given low-resolution night images of the same scene. Promising qualitative and quantitative results will be presented for each model.  ( 2 min )
    Concentration inequalities and optimal number of layers for stochastic deep neural networks. (arXiv:2206.11241v1 [cs.LG])
    We state concentration and martingale inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.  ( 2 min )
    tntorch: Tensor Network Learning with PyTorch. (arXiv:2206.11128v1 [cs.LG])
    We present tntorch, a tensor learning framework that supports multiple decompositions (including Candecomp/Parafac, Tucker, and Tensor Train) under a unified interface. With our library, the user can learn and handle low-rank tensors with automatic differentiation, seamless GPU support, and the convenience of PyTorch's API. Besides decomposition algorithms, tntorch implements differentiable tensor algebra, rank truncation, cross-approximation, batch processing, comprehensive tensor arithmetics, and more.  ( 2 min )
    Neural Inverse Transform Sampler. (arXiv:2206.11172v1 [cs.LG])
    Any explicit functional representation $f$ of a density is hampered by two main obstacles when we wish to use it as a generative model: designing $f$ so that sampling is fast, and estimating $Z = \int f$ so that $Z^{-1}f$ integrates to 1. This becomes increasingly complicated as $f$ itself becomes complicated. In this paper, we show that when modeling one-dimensional conditional densities with a neural network, $Z$ can be exactly and efficiently computed by letting the network represent the cumulative distribution function of a target density, and applying a generalized fundamental theorem of calculus. We also derive a fast algorithm for sampling from the resulting representation by the inverse transform method. By extending these principles to higher dimensions, we introduce the \textbf{Neural Inverse Transform Sampler (NITS)}, a novel deep learning framework for modeling and sampling from general, multidimensional, compactly-supported probability densities. NITS is a highly expressive density estimator that boasts end-to-end differentiability, fast sampling, and exact and cheap likelihood evaluation. We demonstrate the applicability of NITS by applying it to realistic, high-dimensional density estimation tasks: likelihood-based generative modeling on the CIFAR-10 dataset, and density estimation on the UCI suite of benchmark datasets, where NITS produces compelling results rivaling or surpassing the state of the art.  ( 2 min )
    reStructured Pre-training. (arXiv:2206.11147v1 [cs.CL])
    In this work, we try to decipher the internal connection of NLP technology development in the past decades, searching for essence, which rewards us with a (potential) new learning paradigm for NLP tasks, dubbed as reStructured Pre-training (RST). In such a paradigm, the role of data will be re-emphasized, and model pre-training and fine-tuning of downstream tasks are viewed as a process of data storing and accessing. Based on that, we operationalize the simple principle that a good storage mechanism should not only have the ability to cache a large amount of data but also consider the ease of access. We achieve this by pre-training models over restructured data that consist of a variety of valuable information instead of raw data after overcoming several engineering challenges. Experimentally, RST models not only surpass strong competitors (e.g., T0) on 52/55 popular datasets from a variety of NLP tasks, but also achieve superior performance in National College Entrance Examination - English (Gaokao-English),the most authoritative examination in China. Specifically, the proposed system Qin achieves 40 points higher than the average scores made by students and 15 points higher than GPT3 with 1/16 parameters. In particular, Qin gets a high score of 138.5 (the full mark is 150) in the 2018 English exam (national paper III). We have released the Gaokao Benchmark with an online submission platform. In addition, we test our model in the 2022 College Entrance Examination English that happened a few days ago (2022.06.08), and it gets a total score of 134 (v.s. GPT3's 108).  ( 2 min )
    StaDRe and StaDRo: Reliability and Robustness Estimation of ML-based Forecasting using Statistical Distance Measures. (arXiv:2206.11116v1 [cs.LG])
    Reliability estimation of Machine Learning (ML) models is becoming a crucial subject. This is particularly the case when such \mbox{models} are deployed in safety-critical applications, as the decisions based on model predictions can result in hazardous situations. In this regard, recent research has proposed methods to achieve safe, \mbox{dependable}, and reliable ML systems. One such method consists of detecting and analyzing distributional shift, and then measuring how such systems respond to these shifts. This was proposed in earlier work in SafeML. This work focuses on the use of SafeML for time series data, and on reliability and robustness estimation of ML-forecasting methods using statistical distance measures. To this end, distance measures based on the Empirical Cumulative Distribution Function (ECDF) proposed in SafeML are explored to measure Statistical-Distance Dissimilarity (SDD) across time series. We then propose SDD-based Reliability Estimate (StaDRe) and SDD-based Robustness (StaDRo) measures. With the help of a clustering technique, the similarity between the statistical properties of data seen during training and the forecasts is identified. The proposed method is capable of providing a link between dataset SDD and Key Performance Indicators (KPIs) of the ML models.  ( 2 min )
    KeyCLD: Learning Constrained Lagrangian Dynamics in Keypoint Coordinates from Images. (arXiv:2206.11030v1 [cs.LG])
    We present KeyCLD, a framework to learn Lagrangian dynamics from images. Learned keypoints represent semantic landmarks in images and can directly represent state dynamics. Interpreting this state as Cartesian coordinates coupled with explicit holonomic constraints, allows expressing the dynamics with a constrained Lagrangian. Our method explicitly models kinetic and potential energy, thus allowing energy based control. We are the first to demonstrate learning of Lagrangian dynamics from images on the dm_control pendulum, cartpole and acrobot environments. This is a step forward towards learning Lagrangian dynamics from real-world images, since previous work in literature was only applied to minimalistic images with monochromatic shapes on empty backgrounds. Please refer to our project page for code and additional results: https://rdaems.github.io/keycld/  ( 2 min )
    A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta. (arXiv:2206.11124v1 [cs.LG])
    Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze mini-batch SGD for linear models at different momenta and sizes of batches. Our key idea is to describe the loss value sequence in terms of its generating function, which can be written in a compact form assuming a diagonal approximation for the second moments of model weights. By analyzing this generating function, we deduce various conclusions on the convergence conditions, phase structure of the model, and optimal learning settings. As a few examples, we show that 1) the optimization trajectory can generally switch from the "signal-dominated" to the "noise-dominated" phase, at a time scale that can be predicted analytically; 2) in the "signal-dominated" (but not the "noise-dominated") phase it is favorable to choose a large effective learning rate, however its value must be limited for any finite batch size to avoid divergence; 3) optimal convergence rate can be achieved at a negative momentum. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.  ( 2 min )
    Understanding and Extending Subgraph GNNs by Rethinking Their Symmetries. (arXiv:2206.11140v1 [cs.LG])
    Subgraph GNNs are a recent class of expressive Graph Neural Networks (GNNs) which model graphs as collections of subgraphs. So far, the design space of possible Subgraph GNN architectures as well as their basic theoretical properties are still largely unexplored. In this paper, we study the most prominent form of subgraph methods, which employs node-based subgraph selection policies such as ego-networks or node marking and deletion. We address two central questions: (1) What is the upper-bound of the expressive power of these methods? and (2) What is the family of equivariant message passing layers on these sets of subgraphs?. Our first step in answering these questions is a novel symmetry analysis which shows that modelling the symmetries of node-based subgraph collections requires a significantly smaller symmetry group than the one adopted in previous works. This analysis is then used to establish a link between Subgraph GNNs and Invariant Graph Networks (IGNs). We answer the questions above by first bounding the expressive power of subgraph methods by 3-WL, and then proposing a general family of message-passing layers for subgraph methods that generalises all previous node-based Subgraph GNNs. Finally, we design a novel Subgraph GNN dubbed SUN, which theoretically unifies previous architectures while providing better empirical performance on multiple benchmarks.  ( 2 min )
    3D Instance Segmentation of MVS Buildings. (arXiv:2112.09902v2 [cs.CV] UPDATED)
    We present a novel 3D instance segmentation framework for Multi-View Stereo (MVS) buildings in urban scenes. Unlike existing works focusing on semantic segmentation of urban scenes, the emphasis of this work lies in detecting and segmenting 3D building instances even if they are attached and embedded in a large and imprecise 3D surface model. Multi-view RGB images are first enhanced to RGBH images by adding a heightmap and are segmented to obtain all roof instances using a fine-tuned 2D instance segmentation neural network. Instance masks from different multi-view images are then clustered into global masks. Our mask clustering accounts for spatial occlusion and overlapping, which can eliminate segmentation ambiguities among multi-view images. Based on these global masks, 3D roof instances are segmented out by mask back-projections and extended to the entire building instances through a Markov random field optimization. A new dataset that contains instance-level annotation for both 3D urban scenes (roofs and buildings) and drone images (roofs) is provided. To the best of our knowledge, it is the first outdoor dataset dedicated to 3D instance segmentation with much more annotations of attached 3D buildings than existing datasets. Quantitative evaluations and ablation studies have shown the effectiveness of all major steps and the advantages of our multi-view framework over the orthophoto-based method.  ( 3 min )
    Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. (arXiv:2108.02717v2 [cs.LG] UPDATED)
    The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the "reachability" of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.  ( 2 min )
    Constant-Factor Approximation Algorithms for Socially Fair $k$-Clustering. (arXiv:2206.11210v1 [cs.DS])
    We study approximation algorithms for the socially fair $(\ell_p, k)$-clustering problem with $m$ groups, whose special cases include the socially fair $k$-median ($p=1$) and socially fair $k$-means ($p=2$) problems. We present (1) a polynomial-time $(5+2\sqrt{6})^p$-approximation with at most $k+m$ centers (2) a $(5+2\sqrt{6}+\epsilon)^p$-approximation with $k$ centers in time $n^{2^{O(p)}\cdot m^2}$, and (3) a $(15+6\sqrt{6})^p$ approximation with $k$ centers in time $k^{m}\cdot\text{poly}(n)$. The first result is obtained via a refinement of the iterative rounding method using a sequence of linear programs. The latter two results are obtained by converting a solution with up to $k+m$ centers to one with $k$ centers using sparsification methods for (2) and via an exhaustive search for (3). We also compare the performance of our algorithms with existing bicriteria algorithms as well as exactly $k$ center approximation algorithms on benchmark datasets, and find that our algorithms also outperform existing methods in practice.  ( 2 min )
    Generic E-Variables for Exact Sequential k-Sample Tests that allow for Optional Stopping. (arXiv:2106.02693v3 [stat.ME] UPDATED)
    We develop E-variables for testing whether two or more data streams come from the same source or not, and more generally, whether the difference between the sources is larger than some minimal effect size. These E-variables lead to exact, nonasymptotic tests that remain safe, i.e. keep their type-I error guarantees, under flexible sampling scenarios such as optional stopping and continuation. In special cases our E-variables also have an optimal 'growth' property under the alternative. While the construction is generic, we illustrate it through the special case of k x 2 contingency tables, where we also allow for the incorporation of different restrictions on a composite alternative. Comparison to p-value analysis in simulations and a real-world example show that E-variables, through their flexibility, often allow for early stopping of data collection, thereby retaining similar power as classical methods, while also retaining the option of extending or combining data afterwards.  ( 2 min )
    Data-Augmented Contact Model for Rigid Body Simulation. (arXiv:1803.04019v4 [cs.RO] UPDATED)
    Accurately modeling contact behaviors for real-world, near-rigid materials remains a grand challenge for existing rigid-body physics simulators. This paper introduces a data-augmented contact model that incorporates analytical solutions with observed data to predict the 3D contact impulse which could result in rigid bodies bouncing, sliding or spinning in all directions. Our method enhances the expressiveness of the standard Coulomb contact model by learning the contact behaviors from the observed data, while preserving the fundamental contact constraints whenever possible. For example, a classifier is trained to approximate the transitions between static and dynamic frictions, while non-penetration constraint during collision is enforced analytically. Our method computes the aggregated effect of contact for the entire rigid body, instead of predicting the contact force for each contact point individually, maintaining same simulation speed as the number of contact points increases for detailed geometries. Supplemental video: https://shorturl.at/eilwX Keywords: Physics Simulation Algorithms, Dynamics Learning, Contact Learning  ( 2 min )
    Langevin Monte Carlo for Contextual Bandits. (arXiv:2206.11254v1 [cs.LG])
    We study the efficiency of Thompson sampling for contextual bandits. Existing Thompson sampling-based algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefficient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance.  ( 2 min )
    On Uniform Boundedness Properties of SGD and its Momentum Variants. (arXiv:2201.10245v2 [cs.LG] UPDATED)
    A theoretical, and potentially also practical, problem with stochastic gradient descent is that trajectories may escape to infinity. In this note, we investigate uniform boundedness properties of iterates and function values along the trajectories of the stochastic gradient descent algorithm and its important momentum variant. Under smoothness and $R$-dissipativity of the loss function, we show that broad families of step-sizes, including the widely used step-decay and cosine with (or without) restart step-sizes, result in uniformly bounded iterates and function values. Several important applications that satisfy these assumptions, including phase retrieval problems, Gaussian mixture models, and some neural network classifiers, are discussed in detail. We further extend the uniform boundedness of SGD and its momentum variant under the generalized dissipativity for the functions whose tails grow slower than quadratic functions. This includes some interesting applications, for example, Bayesian logistic regression and logistic regression with $\ell_1$ regularization.  ( 2 min )
    Human Pose Estimation from Sparse Inertial Measurements through Recurrent Graph Convolution. (arXiv:2107.11214v3 [cs.CV] UPDATED)
    Conventional methods for human pose estimation either require a high degree of instrumentation, by relying on many inertial measurement units (IMUs), or constraint the recording space, by relying on extrinsic cameras. These deficits are tackled through the approach of human pose estimation from sparse IMU data. We define adjacency adaptive graph convolutional long-short term memory networks (AAGC-LSTM), to tackle human pose estimation based on six IMUs, while incorporating the human body graph structure directly into the network. The AAGC-LSTM combines both spatial and temporal dependency in a single network operation, more memory efficiently than previous approaches. This is made possible by equipping graph convolutions with adjacency adaptivity, which eliminates the problem of information loss in deep or recurrent graph networks, while it also allows for learning unknown dependencies between the human body joints. To further boost accuracy, we propose longitudinal loss weighting to consider natural movement patterns. With our presented approach, we are able to utilize the inherent graph nature of the human body, and thus can outperform the state of the art (SOTA) for human pose estimation from sparse IMU data.  ( 2 min )
    OpenXAI: Towards a Transparent Evaluation of Model Explanations. (arXiv:2206.11104v1 [cs.LG])
    While several types of post hoc explanation methods (e.g., feature attribution methods) have been proposed in recent literature, there is little to no work on systematically benchmarking these methods in an efficient and transparent manner. Here, we introduce OpenXAI, a comprehensive and extensible open source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, (ii) open-source implementations of twenty-two quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, and (iii) the first ever public XAI leaderboards to benchmark explanations. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. OpenXAI datasets and data loaders, implementations of state-of-the-art explanation methods and evaluation metrics, as well as leaderboards are publicly available at https://open-xai.github.io/.
    Large-scale multi-objective influence maximisation with network downscaling. (arXiv:2204.06250v3 [cs.SI] UPDATED)
    Finding the most influential nodes in a network is a computationally hard problem with several possible applications in various kinds of network-based problems. While several methods have been proposed for tackling the influence maximisation (IM) problem, their runtime typically scales poorly when the network size increases. Here, we propose an original method, based on network downscaling, that allows a multi-objective evolutionary algorithm (MOEA) to solve the IM problem on a reduced scale network, while preserving the relevant properties of the original network. The downscaled solution is then upscaled to the original network, using a mechanism based on centrality metrics such as PageRank. Our results on eight large networks (including two with $\sim$50k nodes) demonstrate the effectiveness of the proposed method with a more than 10-fold runtime gain compared to the time needed on the original network, and an up to $82\%$ time reduction compared to CELF.  ( 2 min )
    Active Learning with Safety Constraints. (arXiv:2206.11183v1 [cs.LG])
    Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We reduce this problem to a constrained linear bandits problem, where our goal is to find the best arm satisfying certain (unknown) safety constraints. We propose an adaptive experimental design-based algorithm, which we show efficiently trades off between the difficulty of showing an arm is unsafe vs suboptimal. To our knowledge, our results are the first on best-arm identification in linear bandits with safety constraints. In practice, we demonstrate that this approach performs well on synthetic and real world datasets.
    Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime. (arXiv:2206.09901v2 [math.OC] UPDATED)
    The recently developed average-case analysis of optimization methods allows a more fine-grained and representative convergence analysis than usual worst-case results. In exchange, this analysis requires a more precise hypothesis over the data generating process, namely assuming knowledge of the expected spectral distribution (ESD) of the random matrix associated with the problem. This work shows that the concentration of eigenvalues near the edges of the ESD determines a problem's asymptotic average complexity. This a priori information on this concentration is a more grounded assumption than complete knowledge of the ESD. This approximate concentration is effectively a middle ground between the coarseness of the worst-case scenario convergence and the restrictive previous average-case analysis. We also introduce the Generalized Chebyshev method, asymptotically optimal under a hypothesis on this concentration and globally optimal when the ESD follows a Beta distribution. We compare its performance to classical optimization algorithms, such as gradient descent or Nesterov's scheme, and we show that, in the average-case context, Nesterov's method is universally nearly optimal asymptotically.
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v1 [cs.LG])
    Missing values are unavoidable in many applications of machine learning and present a challenge both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, independent models do not make efficient use of all available data. Conversely, fitting a shared model to the full data set typically relies on imputation which may be suboptimal when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which make predictions that are a) robust to missing values at test time, b) maintains or improves the predictive power of pattern submodels, and c) has a short description enabling improved interpretability. We identify cases where sharing is provably optimal, even when missingness itself is predictive and when the prediction target depends on unobserved variables. Classification and regression experiments on synthetic data and two healthcare data sets demonstrate that our models achieve a favorable trade-off between pattern specialization and information sharing.
    S2RL: Do We Really Need to Perceive All States in Deep Multi-Agent Reinforcement Learning?. (arXiv:2206.11054v1 [cs.LG])
    Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications, where each agent makes a decision based on its own observation. Most mainstream methods treat each local observation as an entirety when modeling the decentralized local utility functions. However, they ignore the fact that local observation information can be further divided into several entities, and only part of the entities is helpful to model inference. Moreover, the importance of different entities may change over time. To improve the performance of decentralized policies, the attention mechanism is used to capture features of local information. Nevertheless, existing attention models rely on dense fully connected graphs and cannot better perceive important states. To this end, we propose a sparse state based MARL (S2RL) framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations. The local utility functions are estimated through the self-attention and sparse attention mechanisms separately, then are combined into a standard joint value function and auxiliary joint value function in the central critic. We design the S2RL framework as a plug-and-play module, making it general enough to be applied to various methods. Extensive experiments on StarCraft II show that S2RL can significantly improve the performance of many state-of-the-art methods.
    Answer Fast: Accelerating BERT on the Tensor Streaming Processor. (arXiv:2206.11062v1 [cs.LG])
    Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication units resulting in a deterministic tail latency of 130 $\mu$s for a batch-1 inference through BERT-base, which is 6X faster than the current state-of-the-art.
    A Context-Integrated Transformer-Based Neural Network for Auction Design. (arXiv:2201.12489v2 [cs.GT] UPDATED)
    One of the central problems in auction design is developing an incentive-compatible mechanism that maximizes the auctioneer's expected revenue. While theoretical approaches have encountered bottlenecks in multi-item auctions, recently, there has been much progress on finding the optimal mechanism through deep learning. However, these works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome such limitations by factoring \emph{public} contextual information of bidders and items into the auction learning framework. We propose $\mathtt{CITransNet}$, a context-integrated transformer-based neural network for optimal auction design, which maintains permutation-equivariance over bids and contexts while being able to find asymmetric solutions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal solutions in single-item settings, outperform strong baselines in multi-item auctions, and generalize well to cases other than those in training.  ( 2 min )
    Data-Free Quantization with Accurate Activation Clipping and Adaptive Batch Normalization. (arXiv:2204.04215v2 [cs.LG] UPDATED)
    Data-free quantization is a task that compresses the neural network to low bit-width without access to original training data. Most existing data-free quantization methods cause severe performance degradation due to inaccurate activation clipping range and quantization error, especially for low bit-width. In this paper, we present a simple yet effective data-free quantization method with accurate activation clipping and adaptive batch normalization. Accurate activation clipping (AAC) improves the model accuracy by exploiting accurate activation information from the full-precision model. Adaptive batch normalization firstly proposes to address the quantization error from distribution changes by updating the batch normalization layer adaptively. Extensive experiments demonstrate that the proposed data-free quantization method can yield surprisingly performance, achieving 64.33% top-1 accuracy of ResNet18 on ImageNet dataset, with 3.7% absolute improvement outperforming the existing state-of-the-art methods.
    LiT: Zero-Shot Transfer with Locked-image text Tuning. (arXiv:2111.07991v3 [cs.CV] UPDATED)
    This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set.  ( 2 min )
    $C^*$-algebra Net: A New Approach Generalizing Neural Network Parameters to $C^*$-algebra. (arXiv:2206.09513v2 [stat.ML] UPDATED)
    We propose a new framework that generalizes the parameters of neural network models to $C^*$-algebra-valued ones. $C^*$-algebra is a generalization of the space of complex numbers. A typical example is the space of continuous functions on a compact space. This generalization enables us to combine multiple models continuously and use tools for functions such as regression and integration. Consequently, we can learn features of data efficiently and adapt the models to problems continuously. We apply our framework to practical problems such as density estimation and few-shot learning and show that our framework enables us to learn features of data even with a limited number of samples. Our new framework highlights the potential possibility of applying the theory of $C^*$-algebra to general neural network models.
    Restless and Uncertain: Robust Policies for Restless Bandits via Deep Multi-Agent Reinforcement Learning. (arXiv:2107.01689v2 [cs.LG] UPDATED)
    We introduce robustness in \textit{restless multi-armed bandits} (RMABs), a popular model for constrained resource allocation among independent stochastic processes (arms). Nearly all RMAB techniques assume stochastic dynamics are precisely known. However, in many real-world settings, dynamics are estimated with significant \emph{uncertainty}, e.g., via historical data, which can lead to bad outcomes if ignored. To address this, we develop an algorithm to compute minimax regret -- robust policies for RMABs. Our approach uses a double oracle framework (oracles for \textit{agent} and \textit{nature}), which is often used for single-process robust planning but requires significant new techniques to accommodate the combinatorial nature of RMABs. Specifically, we design a deep reinforcement learning (RL) algorithm, DDLPO, which tackles the combinatorial challenge by learning an auxiliary "$\lambda$-network" in tandem with policy networks per arm, greatly reducing sample complexity, with guarantees on convergence. DDLPO, of general interest, implements our reward-maximizing agent oracle. We then tackle the challenging regret-maximizing nature oracle, a non-stationary RL challenge, by formulating it as a multi-agent RL problem between a policy optimizer and adversarial nature. This formulation is of general interest -- we solve it for RMABs by creating a multi-agent extension of DDLPO with a shared critic. We show our approaches work well in three experimental domains.
    sqSGD: Locally Private and Communication Efficient Federated Learning. (arXiv:2206.10565v2 [cs.LG] UPDATED)
    Federated learning (FL) is a technique that trains machine learning models from decentralized data sources. We study FL under local notions of privacy constraints, which provides strong protection against sensitive data disclosures via obfuscating the data before leaving the client. We identify two major concerns in designing practical privacy-preserving FL algorithms: communication efficiency and high-dimensional compatibility. We then develop a gradient-based learning algorithm called \emph{sqSGD} (selective quantized stochastic gradient descent) that addresses both concerns. The proposed algorithm is based on a novel privacy-preserving quantization scheme that uses a constant number of bits per dimension per client. Then we improve the base algorithm in three ways: first, we apply a gradient subsampling strategy that simultaneously offers better training performance and smaller communication costs under a fixed privacy budget. Secondly, we utilize randomized rotation as a preprocessing step to reduce quantization error. Thirdly, an adaptive gradient norm upper bound shrinkage strategy is adopted to improve accuracy and stabilize training. Finally, the practicality of the proposed framework is demonstrated on benchmark datasets. Experiment results show that sqSGD successfully learns large models like LeNet and ResNet with local privacy constraints. In addition, with fixed privacy and communication level, the performance of sqSGD significantly dominates that of various baseline algorithms.
    Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution. (arXiv:2206.09114v2 [cs.CV] UPDATED)
    Visual grounding is a task that aims to locate a target object according to a natural language expression. As a multi-modal task, feature interaction between textual and visual inputs is vital. However, previous solutions mainly handle each modality independently before fusing them together, which does not take full advantage of relevant textual information while extracting visual features. To better leverage the textual-visual relationship in visual grounding, we propose a Query-conditioned Convolution Module (QCM) that extracts query-aware visual features by incorporating query information into the generation of convolutional kernels. With our proposed QCM, the downstream fusion module receives visual features that are more discriminative and focused on the desired object described in the expression, leading to more accurate predictions. Extensive experiments on three popular visual grounding datasets demonstrate that our method achieves state-of-the-art performance. In addition, the query-aware visual features are informative enough to achieve comparable performance to the latest methods when directly used for prediction without further multi-modal fusion.
    Multi-Modality Image Inpainting using Generative Adversarial Networks. (arXiv:2206.09210v2 [eess.IV] UPDATED)
    Deep learning techniques, especially Generative Adversarial Networks (GANs) have significantly improved image inpainting and image-to-image translation tasks over the past few years. To the best of our knowledge, the problem of combining the image inpainting task with the multi-modality image-to-image translation remains intact. In this paper, we propose a model to address this problem. The model will be evaluated on combined night-to-day image translation and inpainting, along with promising qualitative and quantitative results.
    Few-Max: Few-Shot Domain Adaptation for Unsupervised Contrastive Representation Learning. (arXiv:2206.10137v2 [cs.CV] UPDATED)
    Contrastive self-supervised learning methods learn to map data points such as images into non-parametric representation space without requiring labels. While highly successful, current methods require a large amount of data in the training phase. In situations where the target training set is limited in size, generalization is known to be poor. Pretraining on a large source data set and fine-tuning on the target samples is prone to overfitting in the few-shot regime, where only a small number of target samples are available. Motivated by this, we propose a domain adaption method for self-supervised contrastive learning, termed Few-Max, to address the issue of adaptation to a target distribution under few-shot learning. To quantify the representation quality, we evaluate Few-Max on a range of source and target datasets, including ImageNet, VisDA, and fastMRI, on which Few-Max consistently outperforms other approaches.
    Exploring Longitudinal Cough, Breath, and Voice Data for COVID-19 Progression Prediction via Sequential Deep Learning: Model Development and Validation. (arXiv:2201.01232v2 [cs.SD] UPDATED)
    Recent work has shown the potential of using audio data (eg, cough, breathing, and voice) in the screening for COVID-19. However, these approaches only focus on one-off detection and detect the infection given the current audio sample, but do not monitor disease progression in COVID-19. Limited exploration has been put forward to continuously monitor COVID-19 progression, especially recovery, through longitudinal audio data. Tracking disease progression characteristics could lead to more timely treatment. The primary objective of this study is to explore the potential of longitudinal audio samples over time for COVID-19 progression prediction and, especially, recovery trend prediction using sequential deep learning techniques. Crowdsourced respiratory audio data, including breathing, cough, and voice samples, from 212 individuals over 5-385 days were analyzed. We developed a deep learning-enabled tracking tool using gated recurrent units (GRUs) to detect COVID-19 progression by exploring the audio dynamics of the individuals' historical audio biomarkers. The investigation comprised 2 parts: (1) COVID-19 detection in terms of positive and negative (healthy) tests, and (2) longitudinal disease progression prediction over time in terms of probability of positive tests. The strong performance for COVID-19 detection, yielding an AUROC of 0.79, a sensitivity of 0.75, and a specificity of 0.71 supported the effectiveness of the approach compared to methods that do not leverage longitudinal dynamics. We further examined the predicted disease progression trajectory, displaying high consistency with test results with a correlation of 0.75 in the test cohort and 0.86 in a subset of the test cohort who reported recovery. Our findings suggest that monitoring COVID-19 evolution via longitudinal audio data has potential in the tracking of individuals' disease progression and recovery.
    Beyond the Quadratic Approximation: the Multiscale Structure of Neural Network Loss Landscapes. (arXiv:2204.11326v3 [cs.LG] UPDATED)
    A quadratic approximation of neural network loss landscapes has been extensively used to study the optimization process of these networks. Though, it usually holds in a very small neighborhood of the minimum, it cannot explain many phenomena observed during the optimization process. In this work, we study the structure of neural network loss functions and its implication on optimization in a region beyond the reach of a good quadratic approximation. Numerically, we observe that neural network loss functions possesses a multiscale structure, manifested in two ways: (1) in a neighborhood of minima, the loss mixes a continuum of scales and grows subquadratically, and (2) in a larger region, the loss shows several separate scales clearly. Using the subquadratic growth, we are able to explain the Edge of Stability phenomenon [5] observed for the gradient descent (GD) method. Using the separate scales, we explain the working mechanism of learning rate decay by simple examples. Finally, we study the origin of the multiscale structure and propose that the non-convexity of the models and the non-uniformity of training data is one of the causes. By constructing a two-layer neural network problem we show that training data with different magnitudes give rise to different scales of the loss function, producing subquadratic growth and multiple separate scales.
    Adversarial Masking for Self-Supervised Learning. (arXiv:2201.13100v2 [cs.CV] UPDATED)
    We propose ADIOS, a masked image model (MIM) framework for self-supervised learning, which simultaneously learns a masking function and an image encoder using an adversarial objective. The image encoder is trained to minimise the distance between representations of the original and that of a masked image. The masking function, conversely, aims at maximising this distance. ADIOS consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets -- including classification on ImageNet100 and STL10, transfer learning on CIFAR10/100, Flowers102 and iNaturalist, as well as robustness evaluated on the backgrounds challenge (Xiao et al., 2021) -- while generating semantically meaningful masks. Unlike modern MIM models such as MAE, BEiT and iBOT, ADIOS does not rely on the image-patch tokenisation construction of Vision Transformers, and can be implemented with convolutional backbones. We further demonstrate that the masks learned by ADIOS are more effective in improving representation learning of SSL methods than masking schemes used in popular MIM models.
    Saute RL: Almost Surely Safe Reinforcement Learning Using State Augmentation. (arXiv:2202.06558v3 [cs.LG] UPDATED)
    Satisfying safety constraints almost surely (or with probability one) can be critical for the deployment of Reinforcement Learning (RL) in real-life applications. For example, plane landing and take-off should ideally occur with probability one. We address the problem by introducing Safety Augmented (Saute) Markov Decision Processes (MDPs), where the safety constraints are eliminated by augmenting them into the state-space and reshaping the objective. We show that Saute MDP satisfies the Bellman equation and moves us closer to solving Safe RL with constraints satisfied almost surely. We argue that Saute MDP allows viewing the Safe RL problem from a different perspective enabling new features. For instance, our approach has a plug-and-play nature, i.e., any RL algorithm can be "Sauteed". Additionally, state augmentation allows for policy generalization across safety constraints. We finally show that Saute RL algorithms can outperform their state-of-the-art counterparts when constraint satisfaction is of high importance.  ( 2 min )
    Universum-inspired Supervised Contrastive Learning. (arXiv:2204.10695v2 [cs.LG] UPDATED)
    Mixup is an efficient data augmentation method which generates additional samples through respective convex combinations of original data points and labels. Although being theoretically dependent on data properties, Mixup is reported to perform well as a regularizer and calibrator contributing reliable robustness and generalization to neural network training. In this paper, inspired by Universum Learning which uses out-of-class samples to assist the target tasks, we investigate Mixup from a largely under-explored perspective - the potential to generate in-domain samples that belong to none of the target classes, that is, universum. We find that in the framework of supervised contrastive learning, universum-style Mixup produces surprisingly high-quality hard negatives, greatly relieving the need for a large batch size in contrastive learning. With these findings, we propose Universum-inspired Contrastive learning (UniCon), which incorporates Mixup strategy to generate universum data as g-negatives and pushes them apart from anchor samples of the target classes. Our approach not only improves Mixup with hard labels, but also innovates a novel measure to generate universum data. With a linear classifier on the learned representations, our method achieves 81.68% top-1 accuracy on CIFAR-100, surpassing the state of art by a significant margin of 5% with a much smaller batch size, typically, 256 in UniCon vs. 1024 in SupCon using ResNet-50.
    Hybrid Intelligent Testing in Simulation-Based Verification. (arXiv:2205.09552v2 [cs.AR] UPDATED)
    Efficient and effective testing for simulation-based hardware verification is challenging. Using constrained random test generation, several millions of tests may be required to achieve coverage goals. The vast majority of tests do not contribute to coverage progress, yet they consume verification resources. In this paper, we propose a hybrid intelligent testing approach combining two methods that have previously been treated separately, namely Coverage-Directed Test Selection and Novelty-Driven Verification. Coverage-Directed Test Selection learns from coverage feedback to bias testing towards the most effective tests. Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli, thereby reducing the number of simulations and increasing testing efficiency. We discuss the strengths and limitations of each method, and we show how our approach addresses each method's limitations, leading to hardware testing that is both efficient and effective.  ( 2 min )
    Deep reinforcement learning for fMRI prediction of Autism Spectrum Disorder. (arXiv:2206.11224v1 [q-bio.NC])
    Purpose : Because functional MRI (fMRI) data sets are in general small, we sought a data efficient approach to resting state fMRI classification of autism spectrum disorder (ASD) versus neurotypical (NT) controls. We hypothesized that a Deep Reinforcement Learning (DRL) classifier could learn effectively on a small fMRI training set. Methods : We trained a Deep Reinforcement Learning (DRL) classifier on 100 graph-label pairs from the Autism Brain Imaging Data Exchange (ABIDE) database. For comparison, we trained a Supervised Deep Learning (SDL) classifier on the same training set. Results : DRL significantly outperformed SDL, with a p-value of 2.4 x 10^(-7). DRL achieved superior results for a variety of classifier performance metrics, including an F1 score of 76, versus 67 for SDL. Whereas SDL quickly overfit the training data, DRL learned in a progressive manner that generalised to the separate testing set. Conclusion : DRL can learn to classify ASD versus NT in a data efficient manner, doing so for a small training set. Future work will involve optimizing the neural network for data efficiency and applying the approach to other fMRI data sets, namely for brain cancer patients.  ( 2 min )
    exploRNN: Understanding Recurrent Neural Networks through Visual Exploration. (arXiv:2012.06326v3 [cs.LG] UPDATED)
    Due to the success of deep learning (DL) and its growing job market, students and researchers from many areas are interested in learning about DL technologies. Visualization has proven to be of great help during this learning process. While most current educational visualizations are targeted towards one specific architecture or use case, recurrent neural networks (RNNs), which are capable of processing sequential data, are not covered yet. This is despite the fact that tasks on sequential data, such as text and function analysis, are at the forefront of DL research. Therefore, we propose exploRNN, the first interactively explorable educational visualization for RNNs. On the basis of making learning easier and more fun, we define educational objectives targeted towards understanding RNNs. We use these objectives to form guidelines for the visual design process. By means of exploRNN, which is accessible online, we provide an overview of the training process of RNNs at a coarse level, while also allowing a detailed inspection of the data flow within LSTM cells. In an empirical study, we assessed 37 subjects in a between-subjects design to investigate the learning outcomes and cognitive load of exploRNN compared to a classic text-based learning environment. While learners in the text group are ahead in superficial knowledge acquisition, exploRNN is particularly helpful for deeper understanding of the learning content. In addition, the complex content in exploRNN is perceived as significantly easier and causes less extraneous load than in the text group. The study shows that for difficult learning material such as recurrent networks, where deep understanding is important, interactive visualizations such as exploRNN can be helpful.
    Multi-hop RIS-Empowered Terahertz Communications: A DRL-based Hybrid Beamforming Design. (arXiv:2101.09137v2 [eess.SP] UPDATED)
    Wireless communication in the TeraHertz band (0.1--10 THz) is envisioned as one of the key enabling technologies for the future sixth generation (6G) wireless communication systems scaled up beyond massive multiple input multiple output (Massive-MIMO) technology. However, very high propagation attenuations and molecular absorptions of THz frequencies often limit the signal transmission distance and coverage range. Benefited from the recent breakthrough on the reconfigurable intelligent surfaces (RIS) for realizing smart radio propagation environment, we propose a novel hybrid beamforming scheme for the multi-hop RIS-assisted communication networks to improve the coverage range at THz-band frequencies. Particularly, multiple passive and controllable RISs are deployed to assist the transmissions between the base station (BS) and multiple single-antenna users. We investigate the joint design of digital beamforming matrix at the BS and analog beamforming matrices at the RISs, by leveraging the recent advances in deep reinforcement learning (DRL) to combat the propagation loss. To improve the convergence of the proposed DRL-based algorithm, two algorithms are then designed to initialize the digital beamforming and the analog beamforming matrices utilizing the alternating optimization technique. Simulation results show that our proposed scheme is able to improve 50\% more coverage range of THz communications compared with the benchmarks. Furthermore, it is also shown that our proposed DRL-based method is a state-of-the-art method to solve the NP-hard beamforming problem, especially when the signals at RIS-assisted THz communication networks experience multiple hops.
    Learning Optimal Treatment Strategies for Sepsis Using Offline Reinforcement Learning in Continuous Space. (arXiv:2206.11190v1 [cs.LG])
    Sepsis is a leading cause of death in the ICU. It is a disease requiring complex interventions in a short period of time, but its optimal treatment strategy remains uncertain. Evidence suggests that the practices of currently used treatment strategies are problematic and may cause harm to patients. To address this decision problem, we propose a new medical decision model based on historical data to help clinicians recommend the best reference option for real-time treatment. Our model combines offline reinforcement learning with deep reinforcement learning to address the problem that traditional reinforcement learning in healthcare cannot interact with the environment, enabling our model to make decisions in a continuous state-action space. We demonstrate that, on average, the treatments recommended by the model are more valuable and reliable than those recommended by clinicians. In a large validation dataset, we found that patients whose actual doses from clinicians matched the AI's decisions had the lowest mortality rates. Our model provides personalized, clinically interpretable treatment decisions for sepsis that can improve patient care.
    General Univariate Estimation-of-Distribution Algorithms. (arXiv:2206.11198v1 [cs.NE])
    We propose a general formulation of a univariate estimation-of-distribution algorithm (EDA). It naturally incorporates the three classic univariate EDAs \emph{compact genetic algorithm}, \emph{univariate marginal distribution algorithm} and \emph{population-based incremental learning} as well as the \emph{max-min ant system} with iteration-best update. Our unified description of the existing algorithms allows a unified analysis of these; we demonstrate this by providing an analysis of genetic drift that immediately gives the existing results proven separately for the four algorithms named above. Our general model also includes EDAs that are more efficient than the existing ones and these may not be difficult to find as we demonstrate for the OneMax and LeadingOnes benchmarks.
    An Embedded Feature Selection Framework for Control. (arXiv:2206.11064v1 [cs.LG])
    Reducing sensor requirements while keeping optimal control performance is crucial to many industrial control applications to achieve robust, low-cost, and computation-efficient controllers. However, existing feature selection solutions for the typical machine learning domain can hardly be applied in the domain of control with changing dynamics. In this paper, a novel framework, namely the Dual-world embedded Attentive Feature Selection (D-AFS), can efficiently select the most relevant sensors for the system under dynamic control. Rather than the one world used in most Deep Reinforcement Learning (DRL) algorithms, D-AFS has both the real world and its virtual peer with twisted features. By analyzing the DRL's response in two worlds, D-AFS can quantitatively identify respective features' importance towards control. A well-known active flow control problem, cylinder drag reduction, is used for evaluation. Results show that D-AFS successfully finds an optimized five-probes layout with 18.7\% drag reduction than the state-of-the-art solution with 151 probes and 49.2\% reduction than five-probes layout by human experts. We also apply this solution to four OpenAI classical control cases. In all cases, D-AFS achieves the same or better sensor configurations than originally provided solutions. Results highlight, we argued, a new way to achieve efficient and optimal sensor designs for experimental or industrial systems. Our source codes are made publicly available at https://github.com/G-AILab/DAFSFluid.
    Surfer100: Generating Surveys From Web Resources, Wikipedia-style. (arXiv:2112.06377v4 [cs.CL] UPDATED)
    Fast-developing fields such as Artificial Intelligence (AI) often outpace the efforts of encyclopedic sources such as Wikipedia, which either do not completely cover recently-introduced topics or lack such content entirely. As a result, methods for automatically producing content are valuable tools to address this information overload. We show that recent advances in pretrained language modeling can be combined for a two-stage extractive and abstractive approach for Wikipedia lead paragraph generation. We extend this approach to generate longer Wikipedia-style summaries with sections and examine how such methods struggle in this application through detailed studies with 100 reference human-collected surveys. This is the first study on utilizing web resources for long Wikipedia-style summaries to the best of our knowledge.
    Beyond Greedy Search: Tracking by Multi-Agent Reinforcement Learning-based Beam Search. (arXiv:2205.09676v2 [cs.CV] UPDATED)
    To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame. However, we found that this may be not an optimal choice, especially when encountering challenging tracking scenarios such as heavy occlusion and fast motion. To address this issue, we propose to maintain multiple tracking trajectories and apply beam search strategy for visual tracking, so that the trajectory with fewer accumulated errors can be identified. Accordingly, this paper introduces a novel multi-agent reinforcement learning based beam search tracking strategy, termed BeamTracking. It is mainly inspired by the image captioning task, which takes an image as input and generates diverse descriptions using beam search algorithm. Accordingly, we formulate the tracking as a sample selection problem fulfilled by multiple parallel decision-making processes, each of which aims at picking out one sample as their tracking result in each frame. Each maintained trajectory is associated with an agent to perform the decision-making and determine what actions should be taken to update related information. When all the frames are processed, we select the trajectory with the maximum accumulated score as the tracking result. Extensive experiments on seven popular tracking benchmark datasets validated the effectiveness of the proposed algorithm.
    Robust fine-tuning of zero-shot models. (arXiv:2109.01903v3 [cs.CV] UPDATED)
    Large pre-trained models such as CLIP or ALIGN offer consistent accuracy across a range of data distributions when performing zero-shot inference (i.e., without fine-tuning on a specific dataset). Although existing fine-tuning methods substantially improve accuracy on a given target distribution, they often reduce robustness to distribution shifts. We address this tension by introducing a simple and effective method for improving robustness while fine-tuning: ensembling the weights of the zero-shot and fine-tuned models (WiSE-FT). Compared to standard fine-tuning, WiSE-FT provides large accuracy improvements under distribution shift, while preserving high accuracy on the target distribution. On ImageNet and five derived distribution shifts, WiSE-FT improves accuracy under distribution shift by 4 to 6 percentage points (pp) over prior work while increasing ImageNet accuracy by 1.6 pp. WiSE-FT achieves similarly large robustness gains (2 to 23 pp) on a diverse set of six further distribution shifts, and accuracy gains of 0.8 to 3.3 pp compared to standard fine-tuning on seven commonly used transfer learning datasets. These improvements come at no additional computational cost during fine-tuning or inference.
    Graph Ordering Attention Networks. (arXiv:2204.05351v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have been successfully used in many problems involving graph-structured data, achieving state-of-the-art performance. GNNs typically employ a message-passing scheme, in which every node aggregates information from its neighbors using a permutation-invariant aggregation function. Standard well-examined choices such as the mean or sum aggregation functions have limited capabilities, as they are not able to capture interactions among neighbors. In this work, we formalize these interactions using an information-theoretic framework that notably includes synergistic information. Driven by this definition, we introduce the Graph Ordering Attention (GOAT) layer, a novel GNN component that captures interactions between nodes in a neighborhood. This is achieved by learning local node orderings via an attention mechanism and processing the ordered representations using a recurrent neural network aggregator. This design allows us to make use of a permutation-sensitive aggregator while maintaining the permutation-equivariance of the proposed GOAT layer. The GOAT model demonstrates its increased performance in modeling graph metrics that capture complex information, such as the betweenness centrality and the effective size of a node. In practical use-cases, its superior modeling capability is confirmed through its success in several real-world node classification benchmarks.
    EXACT: How to Train Your Accuracy. (arXiv:2205.09615v2 [cs.LG] UPDATED)
    Classification tasks are usually evaluated in terms of accuracy. However, accuracy is discontinuous and cannot be directly optimized using gradient ascent. Popular methods minimize cross-entropy, Hinge loss, or other surrogate losses, which can lead to suboptimal results. In this paper, we propose a new optimization framework by introducing stochasticity to a model's output and optimizing expected accuracy, i.e. accuracy of the stochastic model. Extensive experiments on image classification show that the proposed optimization method is a powerful alternative to widely used classification losses.
    Discriminative Bayesian filtering lends momentum to the stochastic Newton method for minimizing log-convex functions. (arXiv:2104.12949v2 [stat.ML] UPDATED)
    To minimize the average of a set of log-convex functions, the stochastic Newton method iteratively updates its estimate using subsampled versions of the full objective's gradient and Hessian. We contextualize this optimization problem as sequential Bayesian inference on a latent state-space model with a discriminatively-specified observation process. Applying Bayesian filtering then yields a novel optimization algorithm that considers the entire history of gradients and Hessians when forming an update. We establish matrix-based conditions under which the effect of older observations diminishes over time, in a manner analogous to Polyak's heavy ball momentum. We illustrate various aspects of our approach with an example and review other relevant innovations for the stochastic Newton method.
    Predicting Human Performance in Vertical Hierarchical Menu Selection in Immersive AR Using Hand-gesture and Head-gaze. (arXiv:2206.09480v1 [cs.HC] CROSS LISTED)
    There are currently limited guidelines on designing user interfaces (UI) for immersive augmented reality (AR) applications. Designers must reflect on their experience designing UI for desktop and mobile applications and conjecture how a UI will influence AR users' performance. In this work, we introduce a predictive model for determining users' performance for a target UI without the subsequent involvement of participants in user studies. The model is trained on participants' responses to objective performance measures such as consumed endurance (CE) and pointing time (PT) using hierarchical drop-down menus. Large variability in the depth and context of the menus is ensured by randomly and dynamically creating the hierarchical drop-down menus and associated user tasks from words contained in the lexical database WordNet. Subjective performance bias is reduced by incorporating the users' non-verbal standard performance WAIS-IV during the model training. The semantic information of the menu is encoded using the Universal Sentence Encoder. We present the results of a user study that demonstrates that the proposed predictive model achieves high accuracy in predicting the CE on hierarchical menus of users with various cognitive abilities. To the best of our knowledge, this is the first work on predicting CE in designing UI for immersive AR applications.
    Intuitive Shape Editing in Latent Space. (arXiv:2111.12488v3 [cs.CV] UPDATED)
    The use of autoencoders for shape editing or generation through latent space manipulation suffers from unpredictable changes in the output shape. Our autoencoder-based method enables intuitive shape editing in latent space by disentangling latent sub-spaces into style variables and control points on the surface that can be manipulated independently. The key idea is adding a Lipschitz-type constraint to the loss function, i.e. bounding the change of the output shape proportionally to the change in latent space, leading to interpretable latent space representations. The control points on the surface that are part of the latent code of an object can then be freely moved, allowing for intuitive shape editing directly in latent space. We evaluate our method by comparing to state-of-the-art data-driven shape editing methods. We further demonstrate the expressiveness of our learned latent space by leveraging it for unsupervised part segmentation.
    Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks. (arXiv:2206.11081v1 [cs.LG])
    Heterogeneous graph neural networks (GNNs) achieve strong performance on node classification tasks in a semi-supervised learning setting. However, as in the simpler homogeneous GNN case, message-passing-based heterogeneous GNNs may struggle to balance between resisting the oversmoothing occuring in deep models and capturing long-range dependencies graph structured data. Moreover, the complexity of this trade-off is compounded in the heterogeneous graph case due to the disparate heterophily relationships between nodes of different types. To address these issues, we proposed a novel heterogeneous GNN architecture in which layers are derived from optimization steps that descend a novel relation-aware energy function. The corresponding minimizer is fully differentiable with respect to the energy function parameters, such that bilevel optimization can be applied to effectively learn a functional form whose minimum provides optimal node representations for subsequent classification tasks. In particular, this methodology allows us to model diverse heterophily relationships between different node types while avoiding oversmoothing effects. Experimental results on 8 heterogeneous graph benchmarks demonstrates that our proposed method can achieve competitive node classification accuracy.
    Correct and Certify: A New Approach to Self-Supervised 3D-Object Perception. (arXiv:2206.11215v1 [cs.CV])
    We consider an object pose estimation and model fitting problem, where - given a partial point cloud of an object - the goal is to estimate the object pose by fitting a CAD model to the sensor data. We solve this problem by combining (i) a semantic keypoint-based pose estimation model, (ii) a novel self-supervised training approach, and (iii) a certification procedure, that not only verifies whether the output produced by the model is correct or not, but also flags uniqueness of the produced solution. The semantic keypoint detector model is initially trained in simulation and does not perform well on real-data due to the domain gap. Our self-supervised training procedure uses a corrector and a certification module to improve the detector. The corrector module corrects the detected keypoints to compensate for the domain gap, and is implemented as a declarative layer, for which we develop a simple differentiation rule. The certification module declares whether the corrected output produced by the model is certifiable (i.e. correct) or not. At each iteration, the approach optimizes over the loss induced only by the certifiable input-output pairs. As training progresses, we see that the fraction of outputs that are certifiable increases, eventually reaching near $100\%$ in many cases. We also introduce the notion of strong certifiability wherein the model can determine if the predicted object model fit is unique or not. The detected semantic keypoints help us implement this in the forward pass. We conduct extensive experiments to evaluate the performance of the corrector, the certification, and the proposed self-supervised training using the ShapeNet and YCB datasets, and show the proposed approach achieves performance comparable to fully supervised baselines while not requiring pose or keypoint supervision on real data.
    Margin Calibration for Long-Tailed Visual Recognition. (arXiv:2112.07225v4 [cs.CV] UPDATED)
    The long-tailed class distribution in visual recognition tasks poses great challenges for neural networks on how to handle the biased predictions between head and tail classes, i.e., the model tends to classify tail classes as head classes. While existing research focused on data resampling and loss function engineering, in this paper, we take a different perspective: the classification margins. We study the relationship between the margins and logits (classification scores) and empirically observe the biased margins and the biased logits are positively correlated. We propose MARC, a simple yet effective MARgin Calibration function to dynamically calibrate the biased margins for unbiased logits. We validate MARC through extensive experiments on common long-tailed benchmarks including CIFAR-LT, ImageNet-LT, Places-LT, and iNaturalist-LT. Experimental results demonstrate that our MARC achieves favorable results on these benchmarks. In addition, MARC is extremely easy to implement with just three lines of code. We hope this simple method will motivate people to rethink the biased margins and biased logits in long-tailed visual recognition.
    Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles. (arXiv:2206.11184v1 [cs.CL])
    Linking neural representations to linguistic factors is crucial in order to build and analyze NLP models interpretable by humans. Among these factors, syntactic roles (e.g. subjects, direct objects,$\dots$) and their realizations are essential markers since they can be understood as a decomposition of predicative structures and thus the meaning of sentences. Starting from a deep probabilistic generative model with attention, we measure the interaction between latent variables and realizations of syntactic roles and show that it is possible to obtain, without supervision, representations of sentences where different syntactic roles correspond to clearly identified different latent variables. The probabilistic model we propose is an Attention-Driven Variational Autoencoder (ADVAE). Drawing inspiration from Transformer-based machine translation models, ADVAEs enable the analysis of the interactions between latent variables and input tokens through attention. We also develop an evaluation protocol to measure disentanglement with regard to the realizations of syntactic roles. This protocol is based on attention maxima for the encoder and on latent variable perturbations for the decoder. Our experiments on raw English text from the SNLI dataset show that $\textit{i)}$ disentanglement of syntactic roles can be induced without supervision, $\textit{ii)}$ ADVAE separates syntactic roles better than classical sequence VAEs and Transformer VAEs, $\textit{iii)}$ realizations of syntactic roles can be separately modified in sentences by mere intervention on the associated latent variables. Our work constitutes a first step towards unsupervised controllable content generation. The code for our work is publicly available.
    Model-Based Deep Learning: On the Intersection of Deep Learning and Optimization. (arXiv:2205.02640v2 [eess.SP] UPDATED)
    Decision making algorithms are used in a multitude of different applications. Conventional approaches for designing decision algorithms employ principled and simplified modelling, based on which one can determine decisions via tractable optimization. More recently, deep learning approaches that use highly parametric architectures tuned from data without relying on mathematical models, are becoming increasingly popular. Model-based optimization and data-centric deep learning are often considered to be distinct disciplines. Here, we characterize them as edges of a continuous spectrum varying in specificity and parameterization, and provide a tutorial-style presentation to the methodologies lying in the middle ground of this spectrum, referred to as model-based deep learning. We accompany our presentation with running examples in super-resolution and stochastic control, and show how they are expressed using the provided characterization and specialized in each of the detailed methodologies. The gains of combining model-based optimization and deep learning are demonstrated using experimental results in various applications, ranging from biomedical imaging to digital communications.
    Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer. (arXiv:2206.11053v1 [cs.CV])
    Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable 'second opinion' could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.
    Information Geometry of Dropout Training. (arXiv:2206.10936v1 [stat.ML])
    Dropout is one of the most popular regularization techniques in neural network training. Because of its power and simplicity of idea, dropout has been analyzed extensively and many variants have been proposed. In this paper, several properties of dropout are discussed in a unified manner from the viewpoint of information geometry. We showed that dropout flattens the model manifold and that their regularization performance depends on the amount of the curvature. Then, we showed that dropout essentially corresponds to a regularization that depends on the Fisher information, and support this result from numerical experiments. Such a theoretical analysis of the technique from a different perspective is expected to greatly assist in the understanding of neural networks, which are still in their infancy.
    World of Bugs: A Platform for Automated Bug Detection in 3D Video Games. (arXiv:2206.11037v1 [cs.SE])
    We present World of Bugs (WOB), an open platform that aims to support Automated Bug Detection (ABD) research in video games. We discuss some open problems in ABD and how they relate to the platform's design, arguing that learning-based solutions are required if further progress is to be made. The platform's key feature is a growing collection of common video game bugs that may be used for training and evaluating ABD approaches.
    Multi-task twin support vector machine with Universum data. (arXiv:2206.10978v1 [cs.LG])
    Multi-task learning (MTL) has emerged as a promising topic of machine learning in recent years, aiming to enhance the performance of numerous related learning tasks by exploiting beneficial information. During the training phase, most of the existing multi-task learning models concentrate entirely on the target task data and ignore the non-target task data contained in the target tasks. To address this issue, Universum data, that do not correspond to any class of a classification problem, may be used as prior knowledge in the training model. This study looks at the challenge of multi-task learning using Universum data to employ non-target task data, which leads to better performance. It proposes a multi-task twin support vector machine with Universum data (UMTSVM) and provides two approaches to its solution. The first approach takes into account the dual formulation of UMTSVM and tries to solve a quadratic programming problem. The second approach formulates a least-squares version of UMTSVM and refers to it as LS-UMTSVM to further increase the generalization performance. The solution of the two primal problems in LS-UMTSVM is simplified to solving just two systems of linear equations, resulting in an incredibly simple and quick approach. Numerical experiments on several popular multi-task data sets and medical data sets demonstrate the efficiency of the proposed methods.
    Auto-Encoding Adversarial Imitation Learning. (arXiv:2206.11004v1 [cs.LG])
    Reinforcement learning (RL) provides a powerful framework for decision-making, but its application in practice often requires a carefully designed reward function. Adversarial Imitation Learning (AIL) sheds light on automatic policy acquisition without access to the reward signal from the environment. In this work, we propose Auto-Encoding Adversarial Imitation Learning (AEAIL), a robust and scalable AIL framework. To induce expert policies from demonstrations, AEAIL utilizes the reconstruction error of an auto-encoder as a reward signal, which provides more information for optimizing policies than the prior discriminator-based ones. Subsequently, we use the derived objective functions to train the auto-encoder and the agent policy. Experiments show that our AEAIL performs superior compared to state-of-the-art methods in the MuJoCo environments. More importantly, AEAIL shows much better robustness when the expert demonstrations are noisy. Specifically, our method achieves $16.4\%$ and $47.2\%$ relative improvement overall compared to the best baseline FAIRL and PWIL on clean and noisy expert data, respectively. Video results, open-source code and dataset are available in https://sites.google.com/view/auto-encoding-imitation.
    Fighting Fire with Fire: Avoiding DNN Shortcuts through Priming. (arXiv:2206.10816v1 [cs.LG])
    Across applications spanning supervised classification and sequential control, deep learning has been reported to find "shortcut" solutions that fail catastrophically under minor changes in the data distribution. In this paper, we show empirically that DNNs can be coaxed to avoid poor shortcuts by providing an additional "priming" feature computed from key input features, usually a coarse output estimate. Priming relies on approximate domain knowledge of these task-relevant key input features, which is often easy to obtain in practical settings. For example, one might prioritize recent frames over past frames in a video input for visual imitation learning, or salient foreground over background pixels for image classification. On NICO image classification, MuJoCo continuous control, and CARLA autonomous driving, our priming strategy works significantly better than several popular state-of-the-art approaches for feature selection and data augmentation. We connect these empirical findings to recent theoretical results on DNN optimization, and argue theoretically that priming distracts the optimizer away from poor shortcuts by creating better, simpler shortcuts.
    AI-based software for lung nodule detection in chest X-rays -- Time for a second reader approach?. (arXiv:2206.10912v1 [eess.IV])
    Objectives: To compare artificial intelligence (AI) as a second reader in detecting lung nodules on chest X-rays (CXR) versus radiologists of two binational institutions, and to evaluate AI performance when using two different modes: automated versus assisted (additional remote radiologist review). Methods: The CXR public database (n = 247) of the Japanese Society of Radiological Technology with various types and sizes of lung nodules was analyzed. Eight radiologists evaluated the CXR images with regard to the presence of lung nodules and nodule conspicuity. After radiologist review, the AI software processed and flagged the CXR with the highest probability of missed nodules. The calculated accuracy metrics were the area under the curve (AUC), sensitivity, specificity, F1 score, false negative case number (FN), and the effect of different AI modes (automated/assisted) on the accuracy of nodule detection. Results: For radiologists, the average AUC value was 0.77 $\pm$ 0.07, while the average FN was 52.63 $\pm$ 17.53 (all studies) and 32 $\pm$ 11.59 (studies containing a nodule of malignant etiology = 32% rate of missed malignant nodules). Both AI modes -- automated and assisted -- produced an average increase in sensitivity (by 14% and 12%) and of F1-score (5% and 6%) and a decrease in specificity (by 10% and 3%, respectively). Conclusions: Both AI modes flagged the pulmonary nodules missed by radiologists in a significant number of cases. AI as a second reader has a high potential to improve diagnostic accuracy and radiology workflow. AI might detect certain pulmonary nodules earlier than radiologists, with a potentially significant impact on patient outcomes.
    A Study on the Evaluation of Generative Models. (arXiv:2206.10935v1 [cs.LG])
    Implicit generative models, which do not return likelihood values, such as generative adversarial networks and diffusion models, have become prevalent in recent years. While it is true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the Inception score (IS) and Frechet Inception Distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we study the evaluation metrics of generative models by generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably making them problematic when used for fain-grained comparison. We further used this experimental setting to study which evaluation metric best correlates with our probabilistic metrics. Lastly, we look into the base features used for metrics such as FID.
    FairGrad: Fairness Aware Gradient Descent. (arXiv:2206.10923v1 [cs.LG])
    We tackle the problem of group fairness in classification, where the objective is to learn models that do not unjustly discriminate against subgroups of the population. Most existing approaches are limited to simple binary tasks or involve difficult to implement training mechanisms. This reduces their practical applicability. In this paper, we propose FairGrad, a method to enforce fairness based on a reweighting scheme that iteratively learns group specific weights based on whether they are advantaged or not. FairGrad is easy to implement and can accommodate various standard fairness definitions. Furthermore, we show that it is comparable to standard baselines over various datasets including ones used in natural language processing and computer vision.
    Automated GI tract segmentation using deep learning. (arXiv:2206.11048v1 [eess.IV])
    The job of Radiation oncologists is to deliver x-ray beams pointed toward the tumor and at the same time avoid the stomach and intestines. With MR-Linacs (magnetic resonance imaging and linear accelerator systems), oncologists can visualize the position of the tumor and allow for precise dose according to tumor cell presence which can vary from day to day. The current job of outlining the position of the stomach and intestines to adjust the X-ray beams direction for the dose delivery to the tumor while avoiding the organs. This is a time-consuming and labor-intensive process that can easily prolong treatments from 15 minutes to an hour a day unless deep learning methods can automate the segmentation process. This paper discusses an automated segmentation process using deep learning to make this process faster and allow more patients to get effective treatment.
    Defect Prediction Using Stylistic Metrics. (arXiv:2206.10959v1 [cs.SE])
    Defect prediction is one of the most popular research topics due to its potential to minimize software quality assurance efforts. Existing approaches have examined defect prediction from various perspectives such as complexity and developer metrics. However, none of these consider programming style for defect prediction. This paper aims at analyzing the impact of stylistic metrics on both within-project and crossproject defect prediction. For prediction, 4 widely used machine learning algorithms namely Naive Bayes, Support Vector Machine, Decision Tree and Logistic Regression are used. The experiment is conducted on 14 releases of 5 popular, open source projects. F1, Precision and Recall are inspected to evaluate the results. Results reveal that stylistic metrics are a good predictor of defects.
    Quantization Robust Federated Learning for Efficient Inference on Heterogeneous Devices. (arXiv:2206.10844v1 [cs.LG])
    Federated Learning (FL) is a machine learning paradigm to distributively learn machine learning models from decentralized data that remains on-device. Despite the success of standard Federated optimization methods, such as Federated Averaging (FedAvg) in FL, the energy demands and hardware induced constraints for on-device learning have not been considered sufficiently in the literature. Specifically, an essential demand for on-device learning is to enable trained models to be quantized to various bit-widths based on the energy needs and heterogeneous hardware designs across the federation. In this work, we introduce multiple variants of federated averaging algorithm that train neural networks robust to quantization. Such networks can be quantized to various bit-widths with only limited reduction in full precision model accuracy. We perform extensive experiments on standard FL benchmarks to evaluate our proposed FedAvg variants for quantization robustness and provide a convergence analysis for our Quantization-Aware variants in FL. Our results demonstrate that integrating quantization robustness results in FL models that are significantly more robust to different bit-widths during quantized on-device inference.
    How to Combine Variational Bayesian Networks in Federated Learning. (arXiv:2206.10897v1 [cs.LG])
    Federated Learning enables multiple data centers to train a central model collaboratively without exposing any confidential data. Even though deterministic models are capable of performing high prediction accuracy, their lack of calibration and capability to quantify uncertainty is problematic for safety-critical applications. Different from deterministic models, probabilistic models such as Bayesian neural networks are relatively well-calibrated and able to quantify uncertainty alongside their competitive prediction accuracy. Both of the approaches appear in the federated learning framework; however, the aggregation scheme of deterministic models cannot be directly applied to probabilistic models since weights correspond to distributions instead of point estimates. In this work, we study the effects of various aggregation schemes for variational Bayesian neural networks. With empirical results on three image classification datasets, we observe that the degree of spread for an aggregated distribution is a significant factor in the learning process. Hence, we present an investigation on the question of how to combine variational Bayesian networks in federated learning, while providing benchmarks for different aggregation settings.
    List-Decodable Covariance Estimation. (arXiv:2206.10942v1 [cs.DS])
    We give the first polynomial time algorithm for \emph{list-decodable covariance estimation}. For any $\alpha > 0$, our algorithm takes input a sample $Y \subseteq \mathbb{R}^d$ of size $n\geq d^{\mathsf{poly}(1/\alpha)}$ obtained by adversarially corrupting an $(1-\alpha)n$ points in an i.i.d. sample $X$ of size $n$ from the Gaussian distribution with unknown mean $\mu_*$ and covariance $\Sigma_*$. In $n^{\mathsf{poly}(1/\alpha)}$ time, it outputs a constant-size list of $k = k(\alpha)= (1/\alpha)^{\mathsf{poly}(1/\alpha)}$ candidate parameters that, with high probability, contains a $(\hat{\mu},\hat{\Sigma})$ such that the total variation distance $TV(\mathcal{N}(\mu_*,\Sigma_*),\mathcal{N}(\hat{\mu},\hat{\Sigma}))<1-O_{\alpha}(1)$. This is the statistically strongest notion of distance and implies multiplicative spectral and relative Frobenius distance approximation for parameters with dimension independent error. Our algorithm works more generally for $(1-\alpha)$-corruptions of any distribution $D$ that possesses low-degree sum-of-squares certificates of two natural analytic properties: 1) anti-concentration of one-dimensional marginals and 2) hypercontractivity of degree 2 polynomials. Prior to our work, the only known results for estimating covariance in the list-decodable setting were for the special cases of list-decodable linear regression and subspace recovery due to Karmarkar, Klivans, and Kothari (2019), Raghavendra and Yau (2019 and 2020) and Bakshi and Kothari (2020). These results need superpolynomial time for obtaining any subconstant error in the underlying dimension. Our result implies the first polynomial-time \emph{exact} algorithm for list-decodable linear regression and subspace recovery that allows, in particular, to obtain $2^{-\mathsf{poly}(d)}$ error in polynomial-time. Our result also implies an improved algorithm for clustering non-spherical mixtures.
    Neural Networks as Paths through the Space of Representations. (arXiv:2206.10999v1 [cs.LG])
    Deep neural networks implement a sequence of layer-by-layer operations that are each relatively easy to understand, but the resulting overall computation is generally difficult to understand. We develop a simple idea for interpreting the layer-by-layer construction of useful representations: the role of each layer is to reformat information to reduce the "distance" to the target outputs. We formalize this intuitive idea of "distance" by leveraging recent work on metric representational similarity, and show how it leads to a rich space of geometric concepts. With this framework, the layer-wise computation implemented by a deep neural network can be viewed as a path in a high-dimensional representation space. We develop tools to characterize the geometry of these in terms of distances, angles, and geodesics. We then ask three sets of questions of residual networks trained on CIFAR-10: (1) how straight are paths, and how does each layer contribute towards the target? (2) how do these properties emerge over training? and (3) how similar are the paths taken by wider versus deeper networks? We conclude by sketching additional ways that this kind of representational geometry can be used to understand and interpret network training, or to prescriptively improve network architectures to suit a task.
    ROSE: A RObust and SEcure DNN Watermarking. (arXiv:2206.11024v1 [cs.CR])
    Protecting the Intellectual Property rights of DNN models is of primary importance prior to their deployment. So far, the proposed methods either necessitate changes to internal model parameters or the machine learning pipeline, or they fail to meet both the security and robustness requirements. This paper proposes a lightweight, robust, and secure black-box DNN watermarking protocol that takes advantage of cryptographic one-way functions as well as the injection of in-task key image-label pairs during the training process. These pairs are later used to prove DNN model ownership during testing. The main feature is that the value of the proof and its security are measurable. The extensive experiments watermarking image classification models for various datasets as well as exposing them to a variety of attacks, show that it provides protection while maintaining an adequate level of security and robustness.
    Guided Diffusion Model for Adversarial Purification from Random Noise. (arXiv:2206.10875v1 [cs.LG])
    In this paper, we propose a novel guided diffusion purification approach to provide a strong defense against adversarial attacks. Our model achieves 89.62% robust accuracy under PGD-L_inf attack (eps = 8/255) on the CIFAR-10 dataset. We first explore the essential correlations between unguided diffusion models and randomized smoothing, enabling us to apply the models to certified robustness. The empirical results show that our models outperform randomized smoothing by 5% when the certified L2 radius r is larger than 0.5.
    S2TNet: Spatio-Temporal Transformer Networks for Trajectory Prediction in Autonomous Driving. (arXiv:2206.10902v1 [cs.CV])
    To safely and rationally participate in dense and heterogeneous traffic, autonomous vehicles require to sufficiently analyze the motion patterns of surrounding traffic-agents and accurately predict their future trajectories. This is challenging because the trajectories of traffic-agents are not only influenced by the traffic-agents themselves but also by spatial interaction with each other. Previous methods usually rely on the sequential step-by-step processing of Long Short-Term Memory networks (LSTMs) and merely extract the interactions between spatial neighbors for single type traffic-agents. We propose the Spatio-Temporal Transformer Networks (S2TNet), which models the spatio-temporal interactions by spatio-temporal Transformer and deals with the temporel sequences by temporal Transformer. We input additional category, shape and heading information into our networks to handle the heterogeneity of traffic-agents. The proposed methods outperforms state-of-the-art methods on ApolloScape Trajectory dataset by more than 7\% on both the weighted sum of Average and Final Displacement Error. Our code is available at https://github.com/chenghuang66/s2tnet.
    Optical Flow Regularization of Implicit Neural Representations for Video Frame Interpolation. (arXiv:2206.10886v1 [cs.CV])
    Recent works have shown the ability of Implicit Neural Representations (INR) to carry meaningful representations of signal derivatives. In this work, we leverage this property to perform Video Frame Interpolation (VFI) by explicitly constraining the derivatives of the INR to satisfy the optical flow constraint equation. We achieve state of the art VFI on limited motion ranges using only a target video and its optical flow, without learning the interpolation operator from additional training data. We further show that constraining the INR derivatives not only allows to better interpolate intermediate frames but also improves the ability of narrow networks to fit the observed frames, which suggests potential applications to video compression and INR optimization.
    Traffic Congestion Prediction Using Machine Learning Techniques. (arXiv:2206.10983v1 [cs.LG])
    The prediction of traffic congestion can serve a crucial role in making future decisions. Although many studies have been conducted regarding congestion, most of these could not cover all the important factors (e.g., weather conditions). We proposed a prediction model for traffic congestion that can predict congestion based on day, time and several weather data (e.g., temperature, humidity). To evaluate our model, it has been tested against the traffic data of New Delhi. With this model, congestion of a road can be predicted one week ahead with an average RMSE of 1.12. Therefore, this model can be used to take preventive measure beforehand.
    Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models. (arXiv:2206.02246v2 [cs.SD] UPDATED)
    We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training. The method requires a short (~3 seconds) sample from the target person, and generation is steered at inference time, without any training steps. At the heart of the method lies a sampling process that combines the estimation of the denoising model with a low-pass version of the new speaker's sample. The objective and subjective evaluations show that our sampling method can generate a voice similar to that of the target speaker in terms of frequency, with an accuracy comparable to state-of-the-art methods, and without training.
    On the Impossibility of Learning to Cooperate with Adaptive Partner Strategies in Repeated Games. (arXiv:2206.10614v1 [cs.GT])
    Learning to cooperate with other agents is challenging when those agents also possess the ability to adapt to our own behavior. Practical and theoretical approaches to learning in cooperative settings typically assume that other agents' behaviors are stationary, or else make very specific assumptions about other agents' learning processes. The goal of this work is to understand whether we can reliably learn to cooperate with other agents without such restrictive assumptions, which are unlikely to hold in real-world applications. Our main contribution is a set of impossibility results, which show that no learning algorithm can reliably learn to cooperate with all possible adaptive partners in a repeated matrix game, even if that partner is guaranteed to cooperate with some stationary strategy. Motivated by these results, we then discuss potential alternative assumptions which capture the idea that an adaptive partner will only adapt rationally to our behavior.
    Learning Neuro-Symbolic Skills for Bilevel Planning. (arXiv:2206.10680v1 [cs.RO])
    Decision-making is challenging in robotics environments with continuous object-centric states, continuous actions, long horizons, and sparse feedback. Hierarchical approaches, such as task and motion planning (TAMP), address these challenges by decomposing decision-making into two or more levels of abstraction. In a setting where demonstrations and symbolic predicates are given, prior work has shown how to learn symbolic operators and neural samplers for TAMP with manually designed parameterized policies. Our main contribution is a method for learning parameterized polices in combination with operators and samplers. These components are packaged into modular neuro-symbolic skills and sequenced together with search-then-sample TAMP to solve new tasks. In experiments in four robotics domains, we show that our approach -- bilevel planning with neuro-symbolic skills -- can solve a wide range of tasks with varying initial states, goals, and objects, outperforming six baselines and ablations. Video: https://youtu.be/PbFZP8rPuGg Code: https://tinyurl.com/skill-learning
    Generational Differences in Automobility: Comparing America's Millennials and Gen Xers Using Gradient Boosting Decision Trees. (arXiv:2206.11056v1 [cs.LG])
    Whether the Millennials are less auto-centric than the previous generations has been widely discussed in the literature. Most existing studies use regression models and assume that all factors are linear-additive in contributing to the young adults' driving behaviors. This study relaxes this assumption by applying a non-parametric statistical learning method, namely the gradient boosting decision trees (GBDT). Using U.S. nationwide travel surveys for 2001 and 2017, this study examines the non-linear dose-response effects of lifecycle, socio-demographic and residential factors on daily driving distances of Millennial and Gen-X young adults. Holding all other factors constant, Millennial young adults had shorter predicted daily driving distances than their Gen-X counterparts. Besides, residential and economic factors explain around 50% of young adults' daily driving distances, while the collective contributions for life course events and demographics are about 33%. This study also identifies the density ranges for formulating effective land use policies aiming at reducing automobile travel demand.
    Performance Prediction Under Dataset Shift. (arXiv:2206.10697v1 [cs.LG])
    ML models deployed in production often have to face unknown domain changes, fundamentally different from their training settings. Performance prediction models carry out the crucial task of measuring the impact of these changes on model performance. We study the generalization capabilities of various performance prediction models to new domains by learning on generated synthetic perturbations. Empirical validation on a benchmark of ten tabular datasets shows that models based upon state-of-the-art shift detection metrics are not expressive enough to generalize to unseen domains, while Error Predictors bring a consistent improvement in performance prediction under shift. We additionally propose a natural and effortless uncertainty estimation of the predicted accuracy that ensures reliable use of performance predictors. Our implementation is available at https: //github.com/dataiku-research/performance_prediction_under_shift.
    Multi-level Domain Adaptation for Lane Detection. (arXiv:2206.10692v1 [cs.CV])
    We focus on bridging domain discrepancy in lane detection among different scenarios to greatly reduce extra annotation and re-training costs for autonomous driving. Critical factors hinder the performance improvement of cross-domain lane detection that conventional methods only focus on pixel-wise loss while ignoring shape and position priors of lanes. To address the issue, we propose the Multi-level Domain Adaptation (MLDA) framework, a new perspective to handle cross-domain lane detection at three complementary semantic levels of pixel, instance and category. Specifically, at pixel level, we propose to apply cross-class confidence constraints in self-training to tackle the imbalanced confidence distribution of lane and background. At instance level, we go beyond pixels to treat segmented lanes as instances and facilitate discriminative features in target domain with triplet learning, which effectively rebuilds the semantic context of lanes and contributes to alleviating the feature confusion. At category level, we propose an adaptive inter-domain embedding module to utilize the position prior of lanes during adaptation. In two challenging datasets, ie TuSimple and CULane, our approach improves lane detection performance by a large margin with gains of 8.8% on accuracy and 7.4% on F1-score respectively, compared with state-of-the-art domain adaptation algorithms.
    Sparse Kernel Gaussian Processes through Iterative Charted Refinement (ICR). (arXiv:2206.10634v1 [cs.LG])
    Gaussian Processes (GPs) are highly expressive, probabilistic models. A major limitation is their computational complexity. Naively, exact GP inference requires $\mathcal{O}(N^3)$ computations with $N$ denoting the number of modeled points. Current approaches to overcome this limitation either rely on sparse, structured or stochastic representations of data or kernel respectively and usually involve nested optimizations to evaluate a GP. We present a new, generative method named Iterative Charted Refinement (ICR) to model GPs on nearly arbitrarily spaced points in $\mathcal{O}(N)$ time for decaying kernels without nested optimizations. ICR represents long- as well as short-range correlations by combining views of the modeled locations at varying resolutions with a user-provided coordinate chart. In our experiment with points whose spacings vary over two orders of magnitude, ICR's accuracy is comparable to state-of-the-art GP methods. ICR outperforms existing methods in terms of computational speed by one order of magnitude on the CPU and GPU and has already been successfully applied to model a GP with $122$ billion parameters.
    Learning Continuous Rotation Canonicalization with Radial Beam Sampling. (arXiv:2206.10690v1 [cs.CV])
    Nearly all state of the art vision models are sensitive to image rotations. Existing methods often compensate for missing inductive biases by using augmented training data to learn pseudo-invariances. Alongside the resource demanding data inflation process, predictions often poorly generalize. The inductive biases inherent to convolutional neural networks allow for translation equivariance through kernels acting parallely to the horizontal and vertical axes of the pixel grid. This inductive bias, however, does not allow for rotation equivariance. We propose a radial beam sampling strategy along with radial kernels operating on these beams to inherently incorporate center-rotation covariance. Together with an angle distance loss, we present a radial beam-based image canonicalization model, short BIC. Our model allows for maximal continuous angle regression and canonicalizes arbitrary center-rotated input images. As a pre-processing model, this enables rotation-invariant vision pipelines with model-agnostic rotation-sensitive downstream predictions. We show that our end-to-end trained angle regressor is able to predict continuous rotation angles on several vision datasets, i.e. FashionMNIST, CIFAR10, COIL100, and LFW.
    TiCo: Transformation Invariance and Covariance Contrast for Self-Supervised Visual Representation Learning. (arXiv:2206.10698v1 [cs.CV])
    We present Transformation Invariance and Covariance Contrast (TiCo) for self-supervised visual representation learning. Similar to other recent self-supervised learning methods, our method is based on maximizing the agreement among embeddings of different distorted versions of the same image, which pushes the encoder to produce transformation invariant representations. To avoid the trivial solution where the encoder generates constant vectors, we regularize the covariance matrix of the embeddings from different images by penalizing low rank solutions. By jointly minimizing the transformation invariance loss and covariance contrast loss, we get an encoder that is able to produce useful representations for downstream tasks. We analyze our method and show that it can be viewed as a variant of MoCo with an implicit memory bank of unlimited size at no extra memory cost. This makes our method perform better than alternative methods when using small batch sizes. TiCo can also be seen as a modification of Barlow Twins. By connecting the contrastive and redundancy-reduction methods together, TiCo gives us new insights into how joint embedding methods work.
    Physics-informed machine learning with differentiable programming for heterogeneous underground reservoir pressure management. (arXiv:2206.10718v1 [physics.comp-ph])
    Avoiding over-pressurization in subsurface reservoirs is critical for applications like CO2 sequestration and wastewater injection. Managing the pressures by controlling injection/extraction are challenging because of complex heterogeneity in the subsurface. The heterogeneity typically requires high-fidelity physics-based models to make predictions on CO$_2$ fate. Furthermore, characterizing the heterogeneity accurately is fraught with parametric uncertainty. Accounting for both, heterogeneity and uncertainty, makes this a computationally-intensive problem challenging for current reservoir simulators. To tackle this, we use differentiable programming with a full-physics model and machine learning to determine the fluid extraction rates that prevent over-pressurization at critical reservoir locations. We use DPFEHM framework, which has trustworthy physics based on the standard two-point flux finite volume discretization and is also automatically differentiable like machine learning models. Our physics-informed machine learning framework uses convolutional neural networks to learn an appropriate extraction rate based on the permeability field. We also perform a hyperparameter search to improve the model's accuracy. Training and testing scenarios are executed to evaluate the feasibility of using physics-informed machine learning to manage reservoir pressures. We constructed and tested a sufficiently accurate simulator that is 400000 times faster than the underlying physics-based simulator, allowing for near real-time analysis and robust uncertainty quantification.
    Dynamic Restrained Uncertainty Weighting Loss for Multitask Learning of Vocal Expression. (arXiv:2206.11049v1 [cs.SD])
    We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dynamic Weight Average, by extending weights with a restraint term to make the learning process more explainable. We use a lightweight multi-exit CNN architecture to implement our proposed loss approach. The experimental H-Mean score (0.394) shows a substantial improvement over the baseline H-Mean score (0.335).  ( 2 min )
    Learning Debiased Classifier with Biased Committee. (arXiv:2206.10843v1 [cs.LG])
    Neural networks are prone to be biased towards spurious correlations between classes and latent attributes exhibited in a major portion of training data, which ruins their generalization capability. This paper proposes a new method for training debiased classifiers with no spurious attribute label. The key idea of the method is to employ a committee of classifiers as an auxiliary module that identifies bias-conflicting data, i.e., data without spurious correlations, and assigns large weights to them when training the main classifier. The committee is learned as a bootstrapped ensemble so that a majority of its classifiers are biased as well as being diverse, and intentionally fail to predict classes of bias-conflicting data accordingly. The consensus within the committee on prediction difficulty thus provides a reliable cue for identifying and weighting bias-conflicting data. Moreover, the committee is also trained with knowledge transferred from the main classifier so that it gradually becomes debiased along with the main classifier and emphasizes more difficult data as training progresses. On five real-world datasets, our method outperforms existing methods using no spurious attribute label like ours and even surpasses those relying on bias labels occasionally.  ( 2 min )
    KiloNeuS: Implicit Neural Representations with Real-Time Global Illumination. (arXiv:2206.10885v1 [cs.CV])
    The latest trends in inverse rendering techniques for reconstruction use neural networks to learn 3D representations as neural fields. NeRF-based techniques fit multi-layer perceptrons (MLPs) to a set of training images to estimate a radiance field which can then be rendered from any virtual camera by means of volume rendering algorithms. Major drawbacks of these representations are the lack of well-defined surfaces and non-interactive rendering times, as wide and deep MLPs must be queried millions of times per single frame. These limitations have recently been singularly overcome, but managing to accomplish this simultaneously opens up new use cases. We present KiloNeuS, a new neural object representation that can be rendered in path-traced scenes at interactive frame rates. KiloNeuS enables the simulation of realistic light interactions between neural and classic primitives in shared scenes, and it demonstrably performs in real-time with plenty of room for future optimizations and extensions.  ( 2 min )
    $\texttt{FedBC}$: Calibrating Global and Local Models via Federated Learning Beyond Consensus. (arXiv:2206.10815v1 [cs.LG])
    In federated learning (FL), the objective of collaboratively learning a global model through aggregation of model updates across devices tends to oppose the goal of personalization via local information. In this work, we calibrate this tradeoff in a quantitative manner through a multi-criterion optimization-based framework, which we cast as a constrained program: the objective for a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considering the Lagrangian relaxation of this problem, we develop an algorithm that allows each node to minimize its local component of Lagrangian through queries to a first-order gradient oracle. Then, the server executes Lagrange multiplier ascent steps followed by a Lagrange multiplier-weighted averaging step. We call this instantiation of the primal-dual method Federated Learning Beyond Consensus ($\texttt{FedBC}$). Theoretically, we establish that $\texttt{FedBC}$ converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on the tolerance parameter that arises due to the proximity constraints. Overall, the analysis is a novel characterization of primal-dual methods applied to non-convex saddle point problems with nonlinear constraints. Finally, we demonstrate that $\texttt{FedBC}$ balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with the state of the art.  ( 3 min )
    Supermodular $\mf$-divergences and bounds on lossy compression and generalization error with mutual $\mf$-information. (arXiv:2206.11042v1 [cs.IT])
    In this paper, we introduce super-modular $\mf$-divergences and provide three applications for them: (i) we introduce Sanov's upper bound on the tail probability of sum of independent random variables based on super-modular $\mf$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $\mf$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $\mf$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $\mf$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds. Moreover, super-modular $\mf$-divergences are utilized to reduce the dimension of the problem and obtain single-letter bounds.
    POGEMA: Partially Observable Grid Environment for Multiple Agents. (arXiv:2206.10944v1 [cs.LG])
    We introduce POGEMA (https://github.com/AIRI-Institute/pogema) a sandbox for challenging partially observable multi-agent pathfinding (PO-MAPF) problems . This is a grid-based environment that was specifically designed to be a flexible, tunable and scalable benchmark. It can be tailored to a variety of PO-MAPF, which can serve as an excellent testing ground for planning and learning methods, and their combination, which will allow us to move towards filling the gap between AI planning and learning.  ( 2 min )
    Deep Reinforcement Learning for Turbulence Modeling in Large Eddy Simulations. (arXiv:2206.11038v1 [physics.flu-dyn])
    Over the last years, supervised learning (SL) has established itself as the state-of-the-art for data-driven turbulence modeling. In the SL paradigm, models are trained based on a dataset, which is typically computed a priori from a high-fidelity solution by applying the respective filter function, which separates the resolved and the unresolved flow scales. For implicitly filtered large eddy simulation (LES), this approach is infeasible, since here, the employed discretization itself acts as an implicit filter function. As a consequence, the exact filter form is generally not known and thus, the corresponding closure terms cannot be computed even if the full solution is available. The reinforcement learning (RL) paradigm can be used to avoid this inconsistency by training not on a previously obtained training dataset, but instead by interacting directly with the dynamical LES environment itself. This allows to incorporate the potentially complex implicit LES filter into the training process by design. In this work, we apply a reinforcement learning framework to find an optimal eddy-viscosity for implicitly filtered large eddy simulations of forced homogeneous isotropic turbulence. For this, we formulate the task of turbulence modeling as an RL task with a policy network based on convolutional neural networks that adapts the eddy-viscosity in LES dynamically in space and time based on the local flow state only. We demonstrate that the trained models can provide long-term stable simulations and that they outperform established analytical models in terms of accuracy. In addition, the models generalize well to other resolutions and discretizations. We thus demonstrate that RL can provide a framework for consistent, accurate and stable turbulence modeling especially for implicitly filtered LES.  ( 3 min )
    Robust Universal Adversarial Perturbations. (arXiv:2206.10858v1 [cs.LG])
    Universal Adversarial Perturbations (UAPs) are imperceptible, image-agnostic vectors that cause deep neural networks (DNNs) to misclassify inputs from a data distribution with high probability. Existing methods do not create UAPs robust to transformations, thereby limiting their applicability as a real-world attacks. In this work, we introduce a new concept and formulation of robust universal adversarial perturbations. Based on our formulation, we build a novel, iterative algorithm that leverages probabilistic robustness bounds for generating UAPs robust against transformations generated by composing arbitrary sub-differentiable transformation functions. We perform an extensive evaluation on the popular CIFAR-10 and ILSVRC 2012 datasets measuring robustness under human-interpretable semantic transformations, such as rotation, contrast changes, etc, that are common in the real-world. Our results show that our generated UAPs are significantly more robust than those from baselines.  ( 2 min )
    Influence of uncertainty estimation techniques on false-positive reduction in liver lesion detection. (arXiv:2206.10911v1 [eess.IV])
    Deep learning techniques show success in detecting objects in medical images, but still suffer from false-positive predictions that may hinder accurate diagnosis. The estimated uncertainty of the neural network output has been used to flag incorrect predictions. We study the role played by features computed from neural network uncertainty estimates and shape-based features computed from binary predictions in reducing false positives in liver lesion detection by developing a classification-based post-processing step for different uncertainty estimation methods. We demonstrate an improvement in the lesion detection performance of the neural network (with respect to F1-score) for all uncertainty estimation methods on two datasets, comprising abdominal MR and CT images respectively. We show that features computed from neural network uncertainty estimates tend not to contribute much toward reducing false positives. Our results show that factors like class imbalance (true over false positive ratio) and shape-based features extracted from uncertainty maps play an important role in distinguishing false positive from true positive predictions
    Graph Neural Networks as Gradient Flows. (arXiv:2206.10991v1 [cs.LG])
    Dynamical systems minimizing an energy are ubiquitous in geometry and physics. We propose a gradient flow framework for GNNs where the equations follow the direction of steepest descent of a learnable energy. This approach allows to explain the GNN evolution from a multi-particle perspective as learning attractive and repulsive forces in feature space via the positive and negative eigenvalues of a symmetric "channel-mixing" matrix. We perform spectral analysis of the solutions and conclude that gradient flow graph convolutional models can induce a dynamics dominated by the graph high frequencies which is desirable for heterophilic datasets. We also describe structural constraints on common GNN architectures allowing to interpret them as gradient flows. We perform thorough ablation studies corroborating our theoretical analysis and show competitive performance of simple and lightweight models on real-world homophilic and heterophilic datasets.
    Agent-based Graph Neural Networks. (arXiv:2206.11010v1 [cs.LG])
    We present a novel graph neural network we call AgentNet, which is designed specifically for graph-level tasks. AgentNet is inspired by sublinear algorithms, featuring a computational complexity that is independent of the graph size. The architecture of AgentNet differs fundamentally from the architectures of known graph neural networks. In AgentNet, some trained \textit{neural agents} intelligently walk the graph, and then collectively decide on the output. We provide an extensive theoretical analysis of AgentNet: We show that the agents can learn to systematically explore their neighborhood and that AgentNet can distinguish some structures that are even indistinguishable by 3-WL. Moreover, AgentNet is able to separate any two graphs which are sufficiently different in terms of subgraphs. We confirm these theoretical results with synthetic experiments on hard-to-distinguish graphs and real-world graph classification tasks. In both cases, we compare favorably not only to standard GNNs but also to computationally more expensive GNN extensions.
    A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement. (arXiv:2206.11000v1 [eess.AS])
    Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we conduct a systematic comparison between different methods of incorporating phonetic information in a speech enhancement model. By conducting a series of controlled experiments, we observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance, considering both causal and non-causal models. Specifically, we evaluate three settings for injecting phonetic information, namely: i) feature conditioning; ii) perceptual supervision; and iii) regularization. Phonetic features are obtained using an intermediate layer of either a supervised pre-trained Automatic Speech Recognition (ASR) model or by using a pre-trained Self-Supervised Learning (SSL) model. We further observe the effect of choosing different embedding layers on performance, considering both manual and learned configurations. Results suggest that using a SSL model as phonetic features outperforms the ASR one in most cases. Interestingly, the conditioning setting performs best among the evaluated configurations.  ( 2 min )
    Decentralized Gossip-Based Stochastic Bilevel Optimization over Communication Networks. (arXiv:2206.10870v1 [stat.ML])
    Bilevel optimization have gained growing interests, with numerous applications found in meta learning, minimax games, reinforcement learning, and nested composition optimization. This paper studies the problem of distributed bilevel optimization over a network where agents can only communicate with neighbors, including examples from multi-task, multi-agent learning and federated learning. In this paper, we propose a gossip-based distributed bilevel learning algorithm that allows networked agents to solve both the inner and outer optimization problems in a single timescale and share information via network propagation. We show that our algorithm enjoys the $\mathcal{O}(\frac{1}{K \epsilon^2})$ per-agent sample complexity for general nonconvex bilevel optimization and $\mathcal{O}(\frac{1}{K \epsilon})$ for strongly convex objective, achieving a speedup that scales linearly with the network size. The sample complexities are optimal in both $\epsilon$ and $K$. We test our algorithm on the examples of hyperparameter tuning and decentralized reinforcement learning. Simulated experiments confirmed that our algorithm achieves the state-of-the-art training efficiency and test accuracy.  ( 2 min )
    COVYT: Introducing the Coronavirus YouTube and TikTok speech dataset featuring the same speakers with and without infection. (arXiv:2206.11045v1 [eess.AS])
    More than two years after its outbreak, the COVID-19 pandemic continues to plague medical systems around the world, putting a strain on scarce resources, and claiming human lives. From the very beginning, various AI-based COVID-19 detection and monitoring tools have been pursued in an attempt to stem the tide of infections through timely diagnosis. In particular, computer audition has been suggested as a non-invasive, cost-efficient, and eco-friendly alternative for detecting COVID-19 infections through vocal sounds. However, like all AI methods, also computer audition is heavily dependent on the quantity and quality of available data, and large-scale COVID-19 sound datasets are difficult to acquire -- amongst other reasons -- due to the sensitive nature of such data. To that end, we introduce the COVYT dataset -- a novel COVID-19 dataset collected from public sources containing more than 8 hours of speech from 65 speakers. As compared to other existing COVID-19 sound datasets, the unique feature of the COVYT dataset is that it comprises both COVID-19 positive and negative samples from all 65 speakers. We analyse the acoustic manifestation of COVID-19 on the basis of these perfectly speaker characteristic balanced `in-the-wild' data using interpretable audio descriptors, and investigate several classification scenarios that shed light into proper partitioning strategies for a fair speech-based COVID-19 detection.
    Predicting Team Performance with Spatial Temporal Graph Convolutional Networks. (arXiv:2206.10720v1 [cs.LG])
    This paper presents a new approach for predicting team performance from the behavioral traces of a set of agents. This spatiotemporal forecasting problem is very relevant to sports analytics challenges such as coaching and opponent modeling. We demonstrate that our proposed model, Spatial Temporal Graph Convolutional Networks (ST-GCN), outperforms other classification techniques at predicting game score from a short segment of player movement and game features. Our proposed architecture uses a graph convolutional network to capture the spatial relationships between team members and Gated Recurrent Units to analyze dynamic motion information. An ablative evaluation was performed to demonstrate the contributions of different aspects of our architecture.
    SpA-Former: Transformer image shadow detection and removal via spatial attention. (arXiv:2206.10910v1 [cs.CV])
    In this paper, we propose an end-to-end SpA-Former to recover a shadow-free image from a single shaded image. Unlike traditional methods that require two steps for shadow detection and then shadow removal, the SpA-Former unifies these steps into one, which is a one-stage network capable of directly learning the mapping function between shadows and no shadows, it does not require a separate shadow detection. Thus, SpA-former is adaptable to real image de-shadowing for shadows projected on different semantic regions. SpA-Former consists of transformer layer and a series of joint Fourier transform residual blocks and two-wheel joint spatial attention. The network in this paper is able to handle the task while achieving a very fast processing efficiency. Our code is relased on https://github.com/ zhangbaijin/Spatial-Transformer-shadow-removal
    Robust Bayesian Recourse. (arXiv:2206.10833v1 [cs.LG])
    Algorithmic recourse aims to recommend an informative feedback to overturn an unfavorable machine learning decision. We introduce in this paper the Bayesian recourse, a model-agnostic recourse that minimizes the posterior probability odds ratio. Further, we present its min-max robust counterpart with the goal of hedging against future changes in the machine learning model parameters. The robust counterpart explicitly takes into account possible perturbations of the data in a Gaussian mixture ambiguity set prescribed using the optimal transport (Wasserstein) distance. We show that the resulting worst-case objective function can be decomposed into solving a series of two-dimensional optimization subproblems, and the min-max recourse finding problem is thus amenable to a gradient descent algorithm. Contrary to existing methods for generating robust recourses, the robust Bayesian recourse does not require a linear approximation step. The numerical experiment demonstrates the effectiveness of our proposed robust Bayesian recourse facing model shifts. Our code is available at https://github.com/VinAIResearch/robust-bayesian-recourse.
    Bregman Power k-Means for Clustering Exponential Family Data. (arXiv:2206.10860v1 [stat.ML])
    Recent progress in center-based clustering algorithms combats poor local minima by implicit annealing, using a family of generalized means. These methods are variations of Lloyd's celebrated $k$-means algorithm, and are most appropriate for spherical clusters such as those arising from Gaussian data. In this paper, we bridge these algorithmic advances to classical work on hard clustering under Bregman divergences, which enjoy a bijection to exponential family distributions and are thus well-suited for clustering objects arising from a breadth of data generating mechanisms. The elegant properties of Bregman divergences allow us to maintain closed form updates in a simple and transparent algorithm, and moreover lead to new theoretical arguments for establishing finite sample bounds that relax the bounded support assumption made in the existing state of the art. Additionally, we consider thorough empirical analyses on simulated experiments and a case study on rainfall data, finding that the proposed method outperforms existing peer methods in a variety of non-Gaussian data settings.
    Learning Distribution Grid Topologies: A Tutorial. (arXiv:2206.10837v1 [math.OC])
    Unveiling feeder topologies from data is of paramount importance to advance situational awareness and proper utilization of smart resources in power distribution grids. This tutorial summarizes, contrasts, and establishes useful links between recent works on topology identification and detection schemes that have been proposed for power distribution grids.% under different regimes of measurement type, observability, and sampling. The primary focus is to highlight methods that overcome the limited availability of measurement devices in distribution grids, while enhancing topology estimates using conservation laws of power-flow physics and structural properties of feeders. Grid data from phasor measurement units or smart meters can be collected either passively in the traditional way, or actively, upon actuating grid resources and measuring the feeder's voltage response. Analytical claims on feeder identifiability and detectability are reviewed under disparate meter placement scenarios. Such topology learning claims can be attained exactly or approximately so via algorithmic solutions with various levels of computational complexity, ranging from least-squares fits to convex optimization problems, and from polynomial-time searches over graphs to mixed-integer programs. This tutorial aspires to provide researchers and engineers with knowledge of the current state-of-the-art in tractable distribution grid learning and insights into future directions of work.
    Diagnostic Tool for Out-of-Sample Model Evaluation. (arXiv:2206.10982v1 [stat.ML])
    Assessment of model fitness is an important step in many problems. Models are typically fitted to training data by minimizing a loss function, such as the squared-error or negative log-likelihood, and it is natural to desire low losses on future data. This letter considers the use of a test data set to characterize the out-of-sample losses of a model. We propose a simple model diagnostic tool that provides finite-sample guarantees under weak assumptions. The tool is computationally efficient and can be interpreted as an empirical quantile. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyper-parameter tuning.
    Play It Cool: Dynamic Shifting Prevents Thermal Throttling. (arXiv:2206.10849v1 [cs.LG])
    Machine learning (ML) has entered the mobile era where an enormous number of ML models are deployed on edge devices. However, running common ML models on edge devices continuously may generate excessive heat from the computation, forcing the device to "slow down" to prevent overheating, a phenomenon called thermal throttling. This paper studies the impact of thermal throttling on mobile phones: when it occurs, the CPU clock frequency is reduced, and the model inference latency may increase dramatically. This unpleasant inconsistent behavior has a substantial negative effect on user experience, but it has been overlooked for a long time. To counter thermal throttling, we propose to utilize dynamic networks with shared weights and dynamically shift between large and small ML models seamlessly according to their thermal profile, i.e., shifting to a small model when the system is about to throttle. With the proposed dynamic shifting, the application runs consistently without experiencing CPU clock frequency degradation and latency increase. In addition, we also study the resulting accuracy when dynamic shifting is deployed and show that our approach provides a reasonable trade-off between model latency and model accuracy.  ( 2 min )
    Multi-Omic Data Integration and Feature Selection for Survival-based Patient Stratification via Supervised Concrete Autoencoders. (arXiv:2206.10699v1 [cs.LG])
    Cancer is a complex disease with significant social and economic impact. Advancements in high-throughput molecular assays and the reduced cost for performing high-quality multi-omics measurements have fuelled insights through machine learning . Previous studies have shown promise on using multiple omic layers to predict survival and stratify cancer patients. In this paper, we developed a Supervised Autoencoder (SAE) model for survival-based multi-omic integration which improves upon previous work, and report a Concrete Supervised Autoencoder model (CSAE), which uses feature selection to jointly reconstruct the input features as well as predict survival. Our experiments show that our models outperform or are on par with some of the most commonly used baselines, while either providing a better survival separation (SAE) or being more interpretable (CSAE). We also perform a feature selection stability analysis on our models and notice that there is a power-law relationship with features which are commonly associated with survival. The code for this project is available at: https://github.com/phcavelar/coxae
    A consistent and flexible framework for deep matrix factorizations. (arXiv:2206.10693v1 [cs.LG])
    Deep matrix factorizations (deep MFs) are recent unsupervised data mining techniques inspired by constrained low-rank approximations. They aim to extract complex hierarchies of features within high-dimensional datasets. Most of the loss functions proposed in the literature to evaluate the quality of deep MF models and the underlying optimization frameworks are not consistent because different losses are used at different layers. In this paper, we introduce two meaningful loss functions for deep MF and present a generic framework to solve the corresponding optimization problems. We illustrate the effectiveness of this approach through the integration of various constraints and regularizations, such as sparsity, nonnegativity and minimum-volume. The models are successfully applied on both synthetic and real data, namely for hyperspectral unmixing and extraction of facial features.
    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. (arXiv:2206.10789v1 [cs.CV])
    We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.  ( 2 min )
    Jointist: Joint Learning for Multi-instrument Transcription and Its Applications. (arXiv:2206.10805v1 [cs.SD])
    In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an explicit multi-instrument functionality while the connection between the transcription and source separation modules is for better transcription performance. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.  ( 2 min )
    DaisyRec 2.0: Benchmarking Recommendation for Rigorous Evaluation. (arXiv:2206.10848v1 [cs.IR])
    Recently, one critical issue looms large in the field of recommender systems -- there are no effective benchmarks for rigorous evaluation -- which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.  ( 2 min )
    Automated Cancer Subtyping via Vector Quantization Mutual Information Maximization. (arXiv:2206.10801v1 [cs.LG])
    Cancer subtyping is crucial for understanding the nature of tumors and providing suitable therapy. However, existing labelling methods are medically controversial, and have driven the process of subtyping away from teaching signals. Moreover, cancer genetic expression profiles are high-dimensional, scarce, and have complicated dependence, thereby posing a serious challenge to existing subtyping models for outputting sensible clustering. In this study, we propose a novel clustering method for exploiting genetic expression profiles and distinguishing subtypes in an unsupervised manner. The proposed method adaptively learns categorical correspondence from latent representations of expression profiles to the subtypes output by the model. By maximizing the problem -- agnostic mutual information between input expression profiles and output subtypes, our method can automatically decide a suitable number of subtypes. Through experiments, we demonstrate that our proposed method can refine existing controversial labels, and, by further medical analysis, this refinement is proven to have a high correlation with cancer survival rates.  ( 2 min )
    Sharp Constants in Uniformity Testing via the Huber Statistic. (arXiv:2206.10722v1 [stat.ML])
    Uniformity testing is one of the most well-studied problems in property testing, with many known test statistics, including ones based on counting collisions, singletons, and the empirical TV distance. It is known that the optimal sample complexity to distinguish the uniform distribution on $m$ elements from any $\epsilon$-far distribution with $1-\delta$ probability is $n = \Theta\left(\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2} + \frac{\log (1/\delta)}{\epsilon^2}\right)$, which is achieved by the empirical TV tester. Yet in simulation, these theoretical analyses are misleading: in many cases, they do not correctly rank order the performance of existing testers, even in an asymptotic regime of all parameters tending to $0$ or $\infty$. We explain this discrepancy by studying the \emph{constant factors} required by the algorithms. We show that the collisions tester achieves a sharp maximal constant in the number of standard deviations of separation between uniform and non-uniform inputs. We then introduce a new tester based on the Huber loss, and show that it not only matches this separation, but also has tails corresponding to a Gaussian with this separation. This leads to a sample complexity of $(1 + o(1))\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2}$ in the regime where this term is dominant, unlike all other existing testers.  ( 2 min )
    Towards OOD Detection in Graph Classification from Uncertainty Estimation Perspective. (arXiv:2206.10691v1 [cs.LG])
    The problem of out-of-distribution detection for graph classification is far from being solved. The existing models tend to be overconfident about OOD examples or completely ignore the detection task. In this work, we consider this problem from the uncertainty estimation perspective and perform the comparison of several recently proposed methods. In our experiment, we find that there is no universal approach for OOD detection, and it is important to consider both graph representations and predictive categorical distribution.
    TraSE: Towards Tackling Authorial Style from a Cognitive Science Perspective. (arXiv:2206.10706v1 [cs.CL])
    Stylistic analysis of text is a key task in research areas ranging from authorship attribution to forensic analysis and personality profiling. The existing approaches for stylistic analysis are plagued by issues like topic influence, lack of discriminability for large number of authors and the requirement for large amounts of diverse data. In this paper, the source of these issues are identified along with the necessity for a cognitive perspective on authorial style in addressing them. A novel feature representation, called Trajectory-based Style Estimation (TraSE), is introduced to support this purpose. Authorship attribution experiments with over 27,000 authors and 1.4 million samples in a cross-domain scenario resulted in 90% attribution accuracy suggesting that the feature representation is immune to such negative influences and an excellent candidate for stylistic analysis. Finally, a qualitative analysis is performed on TraSE using physical human characteristics, like age, to validate its claim on capturing cognitive traits.
    Imitation Learning for Generalizable Self-driving Policy with Sim-to-real Transfer. (arXiv:2206.10797v1 [cs.LG])
    Imitation Learning uses the demonstrations of an expert to uncover the optimal policy and it is suitable for real-world robotics tasks as well. In this case, however, the training of the agent is carried out in a simulation environment due to safety, economic and time constraints. Later, the agent is applied in the real-life domain using sim-to-real methods. In this paper, we apply Imitation Learning methods that solve a robotics task in a simulated environment and use transfer learning to apply these solutions in the real-world environment. Our task is set in the Duckietown environment, where the robotic agent has to follow the right lane based on the input images of a single forward-facing camera. We present three Imitation Learning and two sim-to-real methods capable of achieving this task. A detailed comparison is provided on these techniques to highlight their advantages and disadvantages.  ( 2 min )
    Federated Latent Class Regression for Hierarchical Data. (arXiv:2206.10783v1 [cs.LG])
    Federated Learning (FL) allows a number of agents to participate in training a global machine learning model without disclosing locally stored data. Compared to traditional distributed learning, the heterogeneity (non-IID) of the agents slows down the convergence in FL. Furthermore, many datasets, being too noisy or too small, are easily overfitted by complex models, such as deep neural networks. Here, we consider the problem of using FL regression on noisy, hierarchical and tabular datasets in which user distributions are significantly different. Inspired by Latent Class Regression (LCR), we propose a novel probabilistic model, Hierarchical Latent Class Regression (HLCR), and its extension to Federated Learning, FEDHLCR. FEDHLCR consists of a mixture of linear regression models, allowing better accuracy than simple linear regression, while at the same time maintaining its analytical properties and avoiding overfitting. Our inference algorithm, being derived from Bayesian theory, provides strong convergence guarantees and good robustness to overfitting. Experimental results show that FEDHLCR offers fast convergence even in non-IID datasets.  ( 2 min )
    On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL. (arXiv:2206.10770v1 [cs.LG])
    We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.  ( 2 min )
    Efficient Interdependent Systems Recovery Modeling with DeepONets. (arXiv:2206.10829v1 [cs.LG])
    Modeling the recovery of interdependent critical infrastructure is a key component of quantifying and optimizing societal resilience to disruptive events. However, simulating the recovery of large-scale interdependent systems under random disruptive events is computationally expensive. Therefore, we propose the application of Deep Operator Networks (DeepONets) in this paper to accelerate the recovery modeling of interdependent systems. DeepONets are ML architectures which identify mathematical operators from data. The form of governing equations DeepONets identify and the governing equation of interdependent systems recovery model are similar. Therefore, we hypothesize that DeepONets can efficiently model the interdependent systems recovery with little training data. We applied DeepONets to a simple case of four interdependent systems with sixteen states. DeepONets, overall, performed satisfactorily in predicting the recovery of these interdependent systems for out of training sample data when compared to reference results.  ( 2 min )
    BiometricBlender: Ultra-high dimensional, multi-class synthetic data generator to imitate biometric feature space. (arXiv:2206.10747v1 [cs.LG])
    The lack of freely available (real-life or synthetic) high or ultra-high dimensional, multi-class datasets may hamper the rapidly growing research on feature screening, especially in the field of biometrics, where the usage of such datasets is common. This paper reports a Python package called BiometricBlender, which is an ultra-high dimensional, multi-class synthetic data generator to benchmark a wide range of feature screening methods. During the data generation process, the overall usefulness and the intercorrelations of blended features can be controlled by the user, thus the synthetic feature space is able to imitate the key properties of a real biometric dataset.  ( 2 min )
    Quantum-Enhanced Selection Operators for Evolutionary Algorithms. (arXiv:2206.10743v1 [quant-ph])
    Genetic algorithms have unique properties which are useful when applied to black box optimization. Using selection, crossover, and mutation operators, candidate solutions may be obtained without the need to calculate a gradient. In this work, we study results obtained from using quantum-enhanced operators within the selection mechanism of a genetic algorithm. Our approach frames the selection process as a minimization of a binary quadratic model with which we encode fitness and distance between members of a population, and we leverage a quantum annealing system to sample low energy solutions for the selection mechanism. We benchmark these quantum-enhanced algorithms against classical algorithms over various black-box objective functions, including the OneMax function, and functions from the IOHProfiler library for black-box optimization. We observe a performance gain in average number of generations to convergence for the quantum-enhanced elitist selection operator in comparison to classical on the OneMax function. We also find that the quantum-enhanced selection operator with non-elitist selection outperform benchmarks on functions with fitness perturbation from the IOHProfiler library. Additionally, we find that in the case of elitist selection, the quantum-enhanced operators outperform classical benchmarks on functions with varying degrees of dummy variables and neutrality.  ( 2 min )
    Imitate then Transcend: Multi-Agent Optimal Execution with Dual-Window Denoise PPO. (arXiv:2206.10736v1 [cs.LG])
    A novel framework for solving the optimal execution and placement problems using reinforcement learning (RL) with imitation was proposed. The RL agents trained from the proposed framework consistently outperformed the industry benchmark time-weighted average price (TWAP) strategy in execution cost and showed great generalization across out-of-sample trading dates and tickers. The impressive performance was achieved from three aspects. First, our RL network architecture called Dual-window Denoise PPO enabled efficient learning in a noisy market environment. Second, a reward scheme with imitation learning was designed, and a comprehensive set of market features was studied. Third, our flexible action formulation allowed the RL agent to tackle optimal execution and placement collectively resulting in better performance than solving individual problems separately. The RL agent's performance was evaluated in our multi-agent realistic historical limit order book simulator in which price impact was accurately assessed. In addition, ablation studies were also performed, confirming the superiority of our framework.  ( 2 min )
    Beyond Uniform Lipschitz Condition in Differentially Private Optimization. (arXiv:2206.10713v1 [cs.LG])
    Most prior convergence results on differentially private stochastic gradient descent (DP-SGD) are derived under the simplistic assumption of uniform Lipschitzness, i.e., the per-sample gradients are uniformly bounded. This assumption is unrealistic in many problems, e.g., linear regression with Gaussian data. We relax uniform Lipschitzness by instead assuming that the per-sample gradients have \textit{sample-dependent} upper bounds, i.e., per-sample Lipschitz constants, which themselves may be unbounded. We derive new convergence results for DP-SGD on both convex and nonconvex functions when the per-sample Lipschitz constants have bounded moments. Furthermore, we provide principled guidance on choosing the clip norm in DP-SGD for convex settings satisfying our relaxed version of Lipschitzness, without making distributional assumptions on the Lipschitz constants. We verify the effectiveness of our recommendation via experiments on benchmarking datasets.  ( 2 min )
    Generative Pretraining for Black-Box Optimization. (arXiv:2206.10786v1 [cs.LG])
    Many problems in science and engineering involve optimizing an expensive black-box function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose Black-box Optimization Transformer (BOOMER), a generative framework for pretraining black-box optimizers using offline datasets. In BOOMER, we train an autoregressive model to imitate trajectory runs of implicit black-box function optimizers. Since these trajectories are unavailable by default, we develop a simple randomized heuristic to synthesize trajectories by sorting random points from offline data. We show theoretically that this heuristic induces trajectories that mimic transitions from diverse low-fidelity (exploration) to high-fidelity (exploitation) samples. Further, we introduce mechanisms to control the rate at which a trajectory transitions from exploration to exploitation, and use it to generalize outside the offline data at test-time. Empirically, we instantiate BOOMER using a casually masked Transformer and evaluate it on Design-Bench, where we rank the best on average, outperforming state-of-the-art baselines.  ( 2 min )
    Multi-Resolution, Multi-Horizon Distributed Solar PV Power Forecasting with Forecast Combinations. (arXiv:2206.10795v1 [cs.LG])
    Distributed, small-scale solar photovoltaic (PV) systems are being installed at a rapidly increasing rate. This can cause major impacts on distribution networks and energy markets. As a result, there is a significant need for improved forecasting of the power generation of these systems at different time resolutions and horizons. However, the performance of forecasting models depends on the resolution and horizon. Forecast combinations (ensembles), that combine the forecasts of multiple models into a single forecast may be robust in such cases. Therefore, in this paper, we provide comparisons and insights into the performance of five state-of-the-art forecast models and existing forecast combinations at multiple resolutions and horizons. We propose a forecast combination approach based on particle swarm optimization (PSO) that will enable a forecaster to produce accurate forecasts for the task at hand by weighting the forecasts produced by individual models. Furthermore, we compare the performance of the proposed combination approach with existing forecast combination approaches. A comprehensive evaluation is conducted using a real-world residential PV power data set measured at 25 houses located in three locations in the United States. The results across four different resolutions and four different horizons show that the PSO-based forecast combination approach outperforms the use of any individual forecast model and other forecast combination counterparts, with an average Mean Absolute Scaled Error reduction by 3.81% compared to the best performing individual model. Our approach enables a solar forecaster to produce accurate forecasts for their application regardless of the forecast resolution or horizon.  ( 3 min )
    Meta Reinforcement Learning with Finite Training Tasks -- a Density Estimation Approach. (arXiv:2206.10716v1 [cs.LG])
    In meta reinforcement learning (meta RL), an agent learns from a set of training tasks how to quickly solve a new task, drawn from the same task distribution. The optimal meta RL policy, a.k.a. the Bayes-optimal behavior, is well defined, and guarantees optimal reward in expectation, taken with respect to the task distribution. The question we explore in this work is how many training tasks are required to guarantee approximately optimal behavior with high probability. Recent work provided the first such PAC analysis for a model-free setting, where a history-dependent policy was learned from the training tasks. In this work, we propose a different approach: directly learn the task distribution, using density estimation techniques, and then train a policy on the learned task distribution. We show that our approach leads to bounds that depend on the dimension of the task distribution. In particular, in settings where the task distribution lies in a low-dimensional manifold, we extend our analysis to use dimensionality reduction techniques and account for such structure, obtaining significantly better bounds than previous work, which strictly depend on the number of states and actions. The key of our approach is the regularization implied by the kernel density estimation method. We further demonstrate that this regularization is useful in practice, when `plugged in' the state-of-the-art VariBAD meta RL algorithm.  ( 2 min )
    Efficient and effective training of language and graph neural network models. (arXiv:2206.10781v1 [cs.LG])
    Can we combine heterogenous graph structure with text to learn high-quality semantic and behavioural representations? Graph neural networks (GNN)s encode numerical node attributes and graph structure to achieve impressive performance in a variety of supervised learning tasks. Current GNN approaches are challenged by textual features, which typically need to be encoded to a numerical vector before provided to the GNN that may incur some information loss. In this paper, we put forth an efficient and effective framework termed language model GNN (LM-GNN) to jointly train large-scale language models and graph neural networks. The effectiveness in our framework is achieved by applying stage-wise fine-tuning of the BERT model first with heterogenous graph information and then with a GNN model. Several system and design optimizations are proposed to enable scalable and efficient training. LM-GNN accommodates node and edge classification as well as link prediction tasks. We evaluate the LM-GNN framework in different datasets performance and showcase the effectiveness of the proposed approach. LM-GNN provides competitive results in an Amazon query-purchase-product application.  ( 2 min )
    Does the Data Induce Capacity Control in Deep Learning?. (arXiv:2110.14163v3 [cs.LG] UPDATED)
    We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. We call such eigenspectra "sloppy" because sets of weights corresponding to small eigenvalues can be changed by large magnitudes without affecting the loss. Networks trained on atypical datasets with non-sloppy inputs do not share these traits and deep networks trained on such datasets generalize poorly. Inspired by this, we study the hypothesis that sloppiness of inputs aids generalization in deep networks. We show that if the Hessian is sloppy, we can compute non-vacuous PAC-Bayes generalization bounds analytically. By exploiting our empirical observation that training predominantly takes place in the non-sloppy subspace of the FIM, we develop data-distribution dependent PAC-Bayes priors that lead to accurate generalization bounds using numerical optimization.  ( 2 min )
    Explain to Not Forget: Defending Against Catastrophic Forgetting with XAI. (arXiv:2205.01929v4 [cs.LG] UPDATED)
    The ability to continuously process and retain new information like we do naturally as humans is a feat that is highly sought after when training neural networks. Unfortunately, the traditional optimization algorithms often require large amounts of data available during training time and updates wrt. new data are difficult after the training process has been completed. In fact, when new data or tasks arise, previous progress may be lost as neural networks are prone to catastrophic forgetting. Catastrophic forgetting describes the phenomenon when a neural network completely forgets previous knowledge when given new information. We propose a novel training algorithm called training by explaining in which we leverage Layer-wise Relevance Propagation in order to retain the information a neural network has already learned in previous tasks when training on new data. The method is evaluated on a range of benchmark datasets as well as more complex data. Our method not only successfully retains the knowledge of old tasks within the neural networks but does so more resource-efficiently than other state-of-the-art solutions.  ( 3 min )
    Derivate Informed Neural Operator: An Efficient Framework for High-Dimensional Parametric Derivative Learning. (arXiv:2206.10745v1 [math.NA])
    Neural operators have gained significant attention recently due to their ability to approximate high-dimensional parametric maps between function spaces. At present, only parametric function approximation has been addressed in the neural operator literature. In this work we investigate incorporating parametric derivative information in neural operator training; this information can improve function approximations, additionally it can be used to improve the approximation of the derivative with respect to the parameter, which is often the key to scalable solution of high-dimensional outer-loop problems (e.g. Bayesian inverse problems). Parametric Jacobian information is formally intractable to incorporate due to its high-dimensionality, to address this concern we propose strategies based on reduced SVD, randomized sketching and the use of reduced basis surrogates. All of these strategies only require only $O(r)$ Jacobian actions to construct sample Jacobian data, and allow us to reduce the linear algebra and memory costs associated with the Jacobian training from the product of the input and output dimensions down to $O(r^2)$, where $r$ is the dimensionality associated with the dimension reduction technique. Numerical results for parametric PDE problems demonstrate that the addition of derivative information to the training problem can significantly improve the parametric map approximation, particularly given few data. When Jacobian actions are inexpensive compared to the parametric map, this information can be economically substituted for parametric map data. Additionally we show that Jacobian error approximations improve significantly with the introduction of Jacobian training data. This result opens the door to the use of derivative informed neural operators (DINOs) in outer-loop algorithms where they can amortize the additional training data cost via repeated evaluations.
    MASER: Multi-Agent Reinforcement Learning with Subgoals Generated from Experience Replay Buffer. (arXiv:2206.10607v1 [cs.LG])
    In this paper, we consider cooperative multi-agent reinforcement learning (MARL) with sparse reward. To tackle this problem, we propose a novel method named MASER: MARL with subgoals generated from experience replay buffer. Under the widely-used assumption of centralized training with decentralized execution and consistent Q-value decomposition for MARL, MASER automatically generates proper subgoals for multiple agents from the experience replay buffer by considering both individual Q-value and total Q-value. Then, MASER designs individual intrinsic reward for each agent based on actionable representation relevant to Q-learning so that the agents reach their subgoals while maximizing the joint action value. Numerical results show that MASER significantly outperforms StarCraft II micromanagement benchmark compared to other state-of-the-art MARL algorithms.  ( 2 min )
    A Survey on Computational Intelligence-based Transfer Learning. (arXiv:2206.10593v1 [cs.AI])
    The goal of transfer learning (TL) is providing a framework for exploiting acquired knowledge from source to target data. Transfer learning approaches compared to traditional machine learning approaches are capable of modeling better data patterns from the current domain. However, vanilla TL needs performance improvements by using computational intelligence-based TL. This paper studies computational intelligence-based transfer learning techniques and categorizes them into neural network-based, evolutionary algorithm-based, swarm intelligence-based and fuzzy logic-based transfer learning.  ( 2 min )
    Differentially Private Maximal Information Coefficients. (arXiv:2206.10685v1 [cs.CR])
    The Maximal Information Coefficient (MIC) is a powerful statistic to identify dependencies between variables. However, it may be applied to sensitive data, and publishing it could leak private information. As a solution, we present algorithms to approximate MIC in a way that provides differential privacy. We show that the natural application of the classic Laplace mechanism yields insufficient accuracy. We therefore introduce the MICr statistic, which is a new MIC approximation that is more compatible with differential privacy. We prove MICr is a consistent estimator for MIC, and we provide two differentially private versions of it. We perform experiments on a variety of real and synthetic datasets. The results show that the private MICr statistics significantly outperform direct application of the Laplace mechanism. Moreover, experiments on real-world datasets show accuracy that is usable when the sample size is at least moderately large.  ( 2 min )
    Can Foundation Models Talk Causality?. (arXiv:2206.10591v1 [cs.AI])
    Foundation models are subject to an ongoing heated debate, leaving open the question of progress towards AGI and dividing the community into two camps: the ones who see the arguably impressive results as evidence to the scaling hypothesis, and the others who are worried about the lack of interpretability and reasoning capabilities. By investigating to which extent causal representations might be captured by these large scale language models, we make a humble efforts towards resolving the ongoing philosophical conflicts.  ( 2 min )
    Asymmetric Learned Image Compression with Multi-Scale Residual Block, Importance Map, and Post-Quantization Filtering. (arXiv:2206.10618v1 [eess.IV])
    Recently, deep learning-based image compression has made signifcant progresses, and has achieved better ratedistortion (R-D) performance than the latest traditional method, H.266/VVC, in both subjective metric and the more challenging objective metric. However, a major problem is that many leading learned schemes cannot maintain a good trade-off between performance and complexity. In this paper, we propose an effcient and effective image coding framework, which achieves similar R-D performance with lower complexity than the state of the art. First, we develop an improved multi-scale residual block (MSRB) that can expand the receptive feld and is easier to obtain global information. It can further capture and reduce the spatial correlation of the latent representations. Second, a more advanced importance map network is introduced to adaptively allocate bits to different regions of the image. Third, we apply a 2D post-quantization flter (PQF) to reduce the quantization error, motivated by the Sample Adaptive Offset (SAO) flter in video coding. Moreover, We fnd that the complexity of encoder and decoder have different effects on image compression performance. Based on this observation, we design an asymmetric paradigm, in which the encoder employs three stages of MSRBs to improve the learning capacity, whereas the decoder only needs one stage of MSRB to yield satisfactory reconstruction, thereby reducing the decoding complexity without sacrifcing performance. Experimental results show that compared to the state-of-the-art method, the encoding and decoding time of the proposed method are about 17 times faster, and the R-D performance is only reduced by less than 1% on both Kodak and Tecnick datasets, which is still better than H.266/VVC(4:4:4) and other recent learning-based methods. Our source code is publicly available at https://github.com/fengyurenpingsheng.  ( 3 min )
    On the Maximum Hessian Eigenvalue and Generalization. (arXiv:2206.10654v1 [cs.LG])
    The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.  ( 2 min )
    Artificial intelligence system based on multi-value classification of fully connected neural network for construction management. (arXiv:2206.10604v1 [cs.LG])
    This study is devoted to solving the problem to determine the professional adaptive capabilities of construction management staff using artificial intelligence systems.It is proposed Fully Connected Feed-Forward Neural Network architecture and performed empirical modeling to create a Data Set. Model of artificial intelligence system allows evaluating the processes in an Fully Connected Feed-Forward Neural Network during the execution of multi-value classification of professional areas. A method has been developed for the training process of a machine learning model, which reflects the internal connections between the components of an artificial intelligence system that allow it to learn from training data. To train the neural network, a data set of 35 input parameters and 29 output parameters was used; the amount of data in the set is 936 data lines. Neural network training occurred in the proportion of 10% and 90%, respectively. Results of this study research can be used to further improve the knowledge and skills necessary for successful professional realization.  ( 2 min )
    Generating Diverse Indoor Furniture Arrangements. (arXiv:2206.10608v1 [cs.LG])
    We present a method for generating arrangements of indoor furniture from human-designed furniture layout data. Our method creates arrangements that target specified diversity, such as the total price of all furniture in the room and the number of pieces placed. To generate realistic furniture arrangement, we train a generative adversarial network (GAN) on human-designed layouts. To target specific diversity in the arrangements, we optimize the latent space of the GAN via a quality diversity algorithm to generate a diverse arrangement collection. Experiments show our approach discovers a set of arrangements that are similar to human-designed layouts but varies in price and number of furniture pieces.  ( 2 min )
    Epicasting: An Ensemble Wavelet Neural Network (EWNet) for Forecasting Epidemics. (arXiv:2206.10696v1 [cs.LG])
    Infectious diseases remain among the top contributors to human illness and death worldwide, among which many diseases produce epidemic waves of infection. The unavailability of specific drugs and ready-to-use vaccines to prevent most of these epidemics makes the situation worse. These force public health officials, health care providers, and policymakers to rely on early warning systems generated by reliable and accurate forecasts of epidemics. Accurate forecasts of epidemics can assist stakeholders in tailoring countermeasures, such as vaccination campaigns, staff scheduling, and resource allocation, to the situation at hand, which could translate to reductions in the impact of a disease. Unfortunately, most of these past epidemics (e.g., dengue, malaria, hepatitis, influenza, and most recent, Covid-19) exhibit nonlinear and non-stationary characteristics due to their spreading fluctuations based on seasonal-dependent variability and the nature of these epidemics. We analyze a wide variety of epidemic time series datasets using a maximal overlap discrete wavelet transform (MODWT) based autoregressive neural network and call it EWNet. MODWT techniques effectively characterize non-stationary behavior and seasonal dependencies in the epidemic time series and improve the forecasting scheme of the autoregressive neural network in the proposed ensemble wavelet network framework. From a nonlinear time series viewpoint, we explore the asymptotic stationarity of the proposed EWNet model to show the asymptotic behavior of the associated Markov Chain. We also theoretically investigate the effect of learning stability and the choice of hidden neurons in the proposed EWNet model. From a practical perspective, we compare our proposed EWNet framework with several statistical, machine learning, and deep learning models that have been previously used for epidemic forecasting.  ( 3 min )
    ConTraNet: A single end-to-end hybrid network for EEG-based and EMG-based human machine interfaces. (arXiv:2206.10677v1 [q-bio.NC])
    Objective: Electroencephalography (EEG) and electromyography (EMG) are two non-invasive bio-signals, which are widely used in human machine interface (HMI) technologies (EEG-HMI and EMG-HMI paradigm) for the rehabilitation of physically disabled people. Successful decoding of EEG and EMG signals into respective control command is a pivotal step in the rehabilitation process. Recently, several Convolutional neural networks (CNNs) based architectures are proposed that directly map the raw time-series signal into decision space and the process of meaningful features extraction and classification are performed simultaneously. However, these networks are tailored to the learn the expected characteristics of the given bio-signal and are limited to single paradigm. In this work, we addressed the question that can we build a single architecture which is able to learn distinct features from different HMI paradigms and still successfully classify them. Approach: In this work, we introduce a single hybrid model called ConTraNet, which is based on CNN and Transformer architectures that is equally useful for EEG-HMI and EMG-HMI paradigms. ConTraNet uses CNN block to introduce inductive bias in the model and learn local dependencies, whereas the Transformer block uses the self-attention mechanism to learn the long-range dependencies in the signal, which are crucial for the classification of EEG and EMG signals. Main results: We evaluated and compared the ConTraNet with state-of-the-art methods on three publicly available datasets which belong to EEG-HMI and EMG-HMI paradigms. ConTraNet outperformed its counterparts in all the different category tasks (2-class, 3-class, 4-class, and 10-class decoding tasks). Significance: The results suggest that ConTraNet is robust to learn distinct features from different HMI paradigms and generalizes well as compared to the current state of the art algorithms.  ( 3 min )
    Demystifying the Base and Novel Performances for Few-shot Class-incremental Learning. (arXiv:2206.10596v1 [cs.LG])
    Few-shot class-incremental learning (FSCIL) has addressed challenging real-world scenarios where unseen novel classes continually arrive with few samples. In these scenarios, it is required to develop a model that recognizes the novel classes without forgetting prior knowledge. In other words, FSCIL aims to maintain the base performance and improve the novel performance simultaneously. However, there is little study to investigate the two performances separately. In this paper, we first decompose the entire model into four types of parameters and demonstrate that the tendency of the two performances varies greatly with the updated parameters when the novel classes appear. Based on the analysis, we propose a simple method for FSCIL, coined as NoNPC, which uses normalized prototype classifiers without further training for incremental novel classes. It is shown that our straightforward method has comparable performance with the sophisticated state-of-the-art algorithms.  ( 2 min )
    The Right Tool for the Job: Open-Source Auditing Tools in Machine Learning. (arXiv:2206.10613v1 [cs.LG])
    In recent years, discussions about fairness in machine learning, AI ethics and algorithm audits have increased. Many entities have developed framework guidance to establish a baseline rubric for fairness and accountability. However, in spite of increased discussions and multiple frameworks, algorithm and data auditing still remain difficult to execute in practice. Many open-source auditing tools are available, but users aren't always aware of the tools, what they are useful for, or how to access them. Model auditing and evaluation are not frequently emphasized skills in machine learning. There are also legal reasons for the proactive adoption of these tools that extend beyond the desire for greater fairness in machine learning. There are positive social issues of public perception and goodwill that matter in our highly connected global society. Greater awareness of these tools and the reasons for actively utilizing them may be helpful to the entire continuum of programmers, data scientists, engineers, researchers, users and consumers of AI and machine learning products. It is important for everyone to better understand the input and output differentials, how they are occurring, and what can be done to promote FATE (fairness, accountability, transparency, and ethics) in machine- and deep learning. The ability to freely access open-source auditing tools removes barriers to fairness assessment at the most basic levels of machine learning. This paper aims to reinforce the urgent need to actually use these tools and provides motivations for doing so. The exemplary tools highlighted herein are open-source with software or code-base repositories available that can be used immediately by anyone worldwide.  ( 3 min )
    CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework. (arXiv:2206.10620v1 [cs.LG])
    There is a growing demand for shifting the delivery of AI capability from data centers on the cloud to edge or end devices, exemplified by the fast emerging real-time AI-based apps running on smartphones, AR/VR devices, autonomous vehicles, and various IoT devices. The shift has however been seriously hampered by the large growing gap between DNN computing demands and the computing power on edge or end devices. This article presents the design of XGen, an optimizing framework for DNN designed to bridge the gap. XGen takes cross-cutting co-design as its first-order consideration. Its full-stack AI-oriented optimizations consist of a number of innovative optimizations at every layer of the DNN software stack, all designed in a cooperative manner. The unique technology makes XGen able to optimize various DNNs, including those with an extreme depth (e.g., BERT, GPT, other transformers), and generate code that runs several times faster than those from existing DNN frameworks, while delivering the same level of accuracy.  ( 2 min )
    Good Time to Ask: A Learning Framework for Asking for Help in Embodied Visual Navigation. (arXiv:2206.10606v1 [cs.LG])
    In reality, it is often more efficient to ask for help than to search the entire space to find an object with an unknown location. We present a learning framework that enables an agent to actively ask for help in such embodied visual navigation tasks, where the feedback informs the agent of where the goal is in its view. To emulate the real-world scenario that a teacher may not always be present, we propose a training curriculum where feedback is not always available. We formulate an uncertainty measure of where the goal is and use empirical results to show that through this approach, the agent learns to ask for help effectively while remaining robust when feedback is not available.  ( 2 min )
    Identifying Electrocardiogram Abnormalities Using a Handcrafted-Rule-Enhanced Neural Network. (arXiv:2206.10592v1 [cs.AI])
    A large number of people suffer from life-threatening cardiac abnormalities, and electrocardiogram (ECG) analysis is beneficial to determining whether an individual is at risk of such abnormalities. Automatic ECG classification methods, especially the deep learning based ones, have been proposed to detect cardiac abnormalities using ECG records, showing good potential to improve clinical diagnosis and help early prevention of cardiovascular diseases. However, the predictions of the known neural networks still do not satisfactorily meet the needs of clinicians, and this phenomenon suggests that some information used in clinical diagnosis may not be well captured and utilized by these methods. In this paper, we introduce some rules into convolutional neural networks, which help present clinical knowledge to deep learning based ECG analysis, in order to improve automated ECG diagnosis performance. Specifically, we propose a Handcrafted-Rule-enhanced Neural Network (called HRNN) for ECG classification with standard 12-lead ECG input, which consists of a rule inference module and a deep learning module. Experiments on two large-scale public ECG datasets show that our new approach considerably outperforms existing state-of-the-art methods. Further, our proposed approach not only can improve the diagnosis performance, but also can assist in detecting mislabelled ECG samples. Our codes are available at https://github.com/alwaysbyx/ecg_processing.  ( 2 min )
    Autoencoder-based Attribute Noise Handling Method for Medical Data. (arXiv:2206.10609v1 [cs.LG])
    Medical datasets are particularly subject to attribute noise, that is, missing and erroneous values. Attribute noise is known to be largely detrimental to learning performances. To maximize future learning performances it is primordial to deal with attribute noise before any inference. We propose a simple autoencoder-based preprocessing method that can correct mixed-type tabular data corrupted by attribute noise. No other method currently exists to handle attribute noise in tabular data. We experimentally demonstrate that our method outperforms both state-of-the-art imputation methods and noise correction methods on several real-world medical datasets.  ( 2 min )
    Metareview-informed Explainable Cytokine Storm Detection during CAR-T cell Therapy. (arXiv:2206.10612v1 [q-bio.QM])
    Cytokine release syndrome (CRS), also known as cytokine storm, is one of the most consequential adverse effects of chimeric antigen receptor therapies that have shown promising results in cancer treatment. When emerging, CRS could be identified by the analysis of specific cytokine and chemokine profiles that tend to exhibit similarities across patients. In this paper, we exploit these similarities using machine learning algorithms and set out to pioneer a meta--review informed method for the identification of CRS based on specific cytokine peak concentrations and evidence from previous clinical studies. We argue that such methods could support clinicians in analyzing suspect cytokine profiles by matching them against CRS knowledge from past clinical studies, with the ultimate aim of swift CRS diagnosis. During evaluation with real--world CRS clinical data, we emphasize the potential of our proposed method of producing interpretable results, in addition to being effective in identifying the onset of cytokine storm.  ( 2 min )
    Neural Activation Patterns (NAPs): Visual Explainability of Learned Concepts. (arXiv:2206.10611v1 [cs.LG])
    A key to deciphering the inner workings of neural networks is understanding what a model has learned. Promising methods for discovering learned features are based on analyzing activation values, whereby current techniques focus on analyzing high activation values to reveal interesting features on a neuron level. However, analyzing high activation values limits layer-level concept discovery. We present a method that instead takes into account the entire activation distribution. By extracting similar activation profiles within the high-dimensional activation space of a neural network layer, we find groups of inputs that are treated similarly. These input groups represent neural activation patterns (NAPs) and can be used to visualize and interpret learned layer concepts. We release a framework with which NAPs can be extracted from pre-trained models and provide a visual introspection tool that can be used to analyze NAPs. We tested our method with a variety of networks and show how it complements existing methods for analyzing neural network activation values.  ( 2 min )
    Deep Inverse Reinforcement Learning for Route Choice Modeling. (arXiv:2206.10598v1 [cs.LG])
    Route choice modeling, i.e., the process of estimating the likely path that individuals follow during their journeys, is a fundamental task in transportation planning and demand forecasting. Classical methods generally adopt the discrete choice model (DCM) framework with linear utility functions and high-level route characteristics. While several recent studies have started to explore the applicability of deep learning for travel choice modeling, they are all path-based with relatively simple model architectures and cannot take advantage of detailed link-level features. Existing link-based models, while theoretically promising, are generally not as scalable or flexible enough to account for the destination characteristics. To address these issues, this study proposes a general deep inverse reinforcement learning (IRL) framework for link-based route choice modeling, which is capable of incorporating high-dimensional features and capturing complex relationships. Specifically, we adapt an adversarial IRL model to the route choice problem for efficient estimation of destination-dependent reward and policy functions. Experiment results based on taxi GPS data from Shanghai, China validate the improved performance of the proposed model over conventional DCMs and other imitation learning baselines, even for destinations unseen in the training data. We also demonstrate the model interpretability using explainable AI techniques. The proposed methodology provides a new direction for future development of route choice models. It is general and should be adaptable to other route choice problems across different modes and networks.  ( 2 min )
    Stop ordering machine learning algorithms by their explainability! A user-centered investigation of performance and explainability. (arXiv:2206.10610v1 [cs.LG])
    Machine learning algorithms enable advanced decision making in contemporary intelligent systems. Research indicates that there is a tradeoff between their model performance and explainability. Machine learning models with higher performance are often based on more complex algorithms and therefore lack explainability and vice versa. However, there is little to no empirical evidence of this tradeoff from an end user perspective. We aim to provide empirical evidence by conducting two user experiments. Using two distinct datasets, we first measure the tradeoff for five common classes of machine learning algorithms. Second, we address the problem of end user perceptions of explainable artificial intelligence augmentations aimed at increasing the understanding of the decision logic of high-performing complex models. Our results diverge from the widespread assumption of a tradeoff curve and indicate that the tradeoff between model performance and explainability is much less gradual in the end user's perception. This is a stark contrast to assumed inherent model interpretability. Further, we found the tradeoff to be situational for example due to data complexity. Results of our second experiment show that while explainable artificial intelligence augmentations can be used to increase explainability, the type of explanation plays an essential role in end user perception.  ( 2 min )
  • Open

    $C^*$-algebra Net: A New Approach Generalizing Neural Network Parameters to $C^*$-algebra. (arXiv:2206.09513v2 [stat.ML] UPDATED)
    We propose a new framework that generalizes the parameters of neural network models to $C^*$-algebra-valued ones. $C^*$-algebra is a generalization of the space of complex numbers. A typical example is the space of continuous functions on a compact space. This generalization enables us to combine multiple models continuously and use tools for functions such as regression and integration. Consequently, we can learn features of data efficiently and adapt the models to problems continuously. We apply our framework to practical problems such as density estimation and few-shot learning and show that our framework enables us to learn features of data even with a limited number of samples. Our new framework highlights the potential possibility of applying the theory of $C^*$-algebra to general neural network models.  ( 2 min )
    Tree-Guided Rare Feature Selection and Logic Aggregation with Electronic Health Records Data. (arXiv:2206.09107v1 [cs.LG] CROSS LISTED)
    Statistical learning with a large number of rare binary features is commonly encountered in analyzing electronic health records (EHR) data, especially in the modeling of disease onset with prior medical diagnoses and procedures. Dealing with the resulting highly sparse and large-scale binary feature matrix is notoriously challenging as conventional methods may suffer from a lack of power in testing and inconsistency in model fitting while machine learning methods may suffer from the inability of producing interpretable results or clinically-meaningful risk factors. To improve EHR-based modeling and utilize the natural hierarchical structure of disease classification, we propose a tree-guided feature selection and logic aggregation approach for large-scale regression with rare binary features, in which dimension reduction is achieved through not only a sparsity pursuit but also an aggregation promoter with the logic operator of ``or''. We convert the combinatorial problem into a convex linearly-constrained regularized estimation, which enables scalable computation with theoretical guarantees. In a suicide risk study with EHR data, our approach is able to select and aggregate prior mental health diagnoses as guided by the diagnosis hierarchy of the International Classification of Diseases. By balancing the rarity and specificity of the EHR diagnosis records, our strategy improves both prediction and model interpretation. We identify important higher-level categories and subcategories of mental health conditions and simultaneously determine the level of specificity needed for each of them in predicting suicide risk.  ( 3 min )
    Multiple Testing Framework for Out-of-Distribution Detection. (arXiv:2206.09522v2 [stat.ML] UPDATED)
    We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a learning algorithm's output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the learning algorithm, which provides insights for the construction of powerful tests for OOD detection. We propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the learning algorithm using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different types of OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks.  ( 2 min )
    Inference of Multiscale Gaussian Graphical Model. (arXiv:2202.05775v2 [stat.ML] UPDATED)
    Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.
    Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. (arXiv:2108.02717v2 [cs.LG] UPDATED)
    The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible -- there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity -- yielding a complexity which scales with the suboptimality gaps and the "reachability" of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.
    Nonparametric Multi-shape Modeling with Uncertainty Quantification. (arXiv:2206.09127v2 [stat.ML] UPDATED)
    The modeling and uncertainty quantification of closed curves is an important problem in the field of shape analysis, and can have significant ramifications for subsequent statistical tasks. Many of these tasks involve collections of closed curves, which often exhibit structural similarities at multiple levels. Modeling multiple closed curves in a way that efficiently incorporates such between-curve dependence remains a challenging problem. In this work, we propose and investigate a multiple-output (a.k.a. multi-output), multi-dimensional Gaussian process modeling framework. We illustrate the proposed methodological advances, and demonstrate the utility of meaningful uncertainty quantification, on several curve and shape-related tasks. This model-based approach not only addresses the problem of inference on closed curves (and their shapes) with kernel constructions, but also opens doors to nonparametric modeling of multi-level dependence for functional objects in general.
    Private and polynomial time algorithms for learning Gaussians and beyond. (arXiv:2111.11320v3 [stat.ML] UPDATED)
    We present a fairly general framework for reducing $(\varepsilon, \delta)$ differentially private (DP) statistical estimation to its non-private counterpart. As the main application of this framework, we give a polynomial time and $(\varepsilon,\delta)$-DP algorithm for learning (unrestricted) Gaussian distributions in $\mathbb{R}^d$. The sample complexity of our approach for learning the Gaussian up to total variation distance $\alpha$ is $\widetilde{O}(d^2/\alpha^2 + d^2\sqrt{\ln(1/\delta)}/\alpha \varepsilon + d\ln(1/\delta) / \alpha \varepsilon)$ matching (up to logarithmic factors) the best known information-theoretic (non-efficient) sample complexity upper bound due to Aden-Ali, Ashtiani, and Kamath (ALT'21). In an independent work, Kamath, Mouzakis, Singhal, Steinke, and Ullman (arXiv:2111.04609) proved a similar result using a different approach and with $O(d^{5/2})$ sample complexity dependence on $d$. As another application of our framework, we provide the first polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of (unrestricted) Gaussians with sample complexity $\widetilde{O}(d^{3.5})$. In another independent work, Kothari, Manurangsi, and Velingker (arXiv:2112.03548) also provided a polynomial time $(\varepsilon, \delta)$-DP algorithm for robust learning of Gaussians with sample complexity $\widetilde{O}(d^8)$.
    Convergence Rates for Learning Linear Operators from Noisy Data. (arXiv:2108.12515v2 [math.ST] UPDATED)
    This paper studies the learning of linear operators between infinite-dimensional Hilbert spaces. The training data comprises pairs of random input vectors in a Hilbert space and their noisy images under an unknown self-adjoint linear operator. Assuming that the operator is diagonalizable in a known basis, this work solves the equivalent inverse problem of estimating the operator's eigenvalues given the data. Adopting a Bayesian approach, the theoretical analysis establishes posterior contraction rates in the infinite data limit with Gaussian priors that are not directly linked to the forward map of the inverse problem. The main results also include learning-theoretic generalization error guarantees for a wide range of distribution shifts. These convergence rates quantify the effects of data smoothness and true eigenvalue decay or growth, for compact or unbounded operators, respectively, on sample complexity. Numerical evidence supports the theory in diagonal and non-diagonal settings.
    MMD Aggregated Two-Sample Test. (arXiv:2110.15073v2 [stat.ML] UPDATED)
    We propose a novel nonparametric two-sample test based on the Maximum Mean Discrepancy (MMD), which is constructed by aggregating tests with different kernel bandwidths. This aggregation procedure, called MMDAgg, ensures that test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We work in the non-asymptotic framework, and prove that our aggregated test is minimax adaptive over Sobolev balls. Our guarantees are not restricted to a specific kernel, but hold for any product of one-dimensional translation invariant characteristic kernels which are absolutely and square integrable. Moreover, our results apply for popular numerical procedures to determine the test threshold, namely permutations and the wild bootstrap. Through numerical experiments on both synthetic and real-world datasets, we demonstrate that MMDAgg outperforms alternative state-of-the-art approaches to MMD kernel adaptation for two-sample testing.
    Efficient Online Linear Control with Stochastic Convex Costs and Unknown Dynamics. (arXiv:2203.01170v2 [math.OC] UPDATED)
    We consider the problem of controlling an unknown linear dynamical system under a stochastic convex cost and full feedback of both the state and cost function. We present a computationally efficient algorithm that attains an optimal $\sqrt{T}$ regret-rate compared to the best stabilizing linear controller in hindsight. In contrast to previous work, our algorithm is based on the Optimism in the Face of Uncertainty paradigm. This results in a substantially improved computational complexity and a simpler analysis.
    Regression-based projection for learning Mori-Zwanzig operators. (arXiv:2205.05135v2 [math.DS] UPDATED)
    We propose to adopt statistical regression as the projection operator to enable data-driven learning of the operators in the Mori--Zwanzig formalism. We present a principled method to extract the Markov and memory operators for any regression models. We show that the choice of linear regression results in a recently proposed data-driven learning algorithm based on Mori's projection operator, which is a higher-order approximate Koopman learning method. We show that more expressive nonlinear regression models naturally fill in the gap between the highly idealized and computationally efficient Mori's projection operator and the most optimal yet computationally infeasible Zwanzig's projection operator. We performed numerical experiments and extracted the operators for an array of regression-based projections, including linear, polynomial, spline, and neural-network-based regressions, showing a progressive improvement as the complexity of the regression model increased. Our proposition provides a general framework to extract memory-dependent corrections and can be readily applied to an array of data-driven learning methods for stationary dynamical systems in the literature.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v3 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    Least Squares Estimation Using Sketched Data with Heteroskedastic Errors. (arXiv:2007.07781v3 [stat.ML] UPDATED)
    Researchers may perform regressions using a sketch of data of size $m$ instead of the full sample of size $n$ for a variety of reasons. This paper considers the case when the regression errors do not have constant variance and heteroskedasticity robust standard errors would normally be needed for test statistics to provide accurate inference. We show that estimates using data sketched by random projections will behave `as if' the errors were homoskedastic. Estimation by random sampling would not have this property. The result arises because the sketched estimates in the case of random projections can be expressed as degenerate $U$-statistics, and under certain conditions, these statistics are asymptotically normal with homoskedastic variance. We verify that the conditions hold not only in the case of least squares regression when the covariates are exogenous, but also in instrumental variables estimation when the covariates are endogenous. The result implies that inference, including first-stage F tests for instrument relevance, can be simpler than the full sample case if the sketching scheme is appropriately chosen.
    Minimax Semiparametric Learning With Approximate Sparsity. (arXiv:1912.12213v4 [math.ST] UPDATED)
    This paper is about the feasibility and means of root-n consistently estimating linear, mean-square continuous functionals of a high dimensional, approximately sparse regression. Such objects include a wide variety of interesting parameters such as regression coefficients, average derivatives, and the average treatment effect. We give lower bounds on the convergence rate of estimators of a regression slope and an average derivative and find that these bounds are substantially larger than in a low dimensional, semiparametric setting. We also give debiased machine learners that are root-n consistent under either a minimal approximate sparsity condition or rate double robustness. These estimators improve on existing estimators in being root-n consistent under more general conditions that previously known.
    Discriminative Bayesian filtering lends momentum to the stochastic Newton method for minimizing log-convex functions. (arXiv:2104.12949v2 [stat.ML] UPDATED)
    To minimize the average of a set of log-convex functions, the stochastic Newton method iteratively updates its estimate using subsampled versions of the full objective's gradient and Hessian. We contextualize this optimization problem as sequential Bayesian inference on a latent state-space model with a discriminatively-specified observation process. Applying Bayesian filtering then yields a novel optimization algorithm that considers the entire history of gradients and Hessians when forming an update. We establish matrix-based conditions under which the effect of older observations diminishes over time, in a manner analogous to Polyak's heavy ball momentum. We illustrate various aspects of our approach with an example and review other relevant innovations for the stochastic Newton method.
    Algorithms that get old : the case of generative deep neural networks. (arXiv:2202.03008v2 [stat.ML] UPDATED)
    Generative deep neural networks used in machine learning, like the Variational Auto-Encoders (VAE), and Generative Adversarial Networks (GANs) produce new objects each time when asked to do so with the constraint that the new objects remain similar to some list of examples given as input. However, this behavior is unlike that of human artists that change their style as times go by and seldom return to the initial creations. We investigate a situation where VAEs are used to sample from a probability measure described by some empirical dataset. Based on recent works on Radon-Sobolev statistical distances, we propose a numerical paradigm, to be used in conjunction with a generative algorithm, that satisfies the two following requirements: the objects created do not repeat and evolve to fill the entire target probability measure.
    Noisy $\ell^{0}$-Sparse Subspace Clustering on Dimensionality Reduced Data. (arXiv:2206.11079v1 [stat.ML])
    Sparse subspace clustering methods with sparsity induced by $\ell^{0}$-norm, such as $\ell^{0}$-Sparse Subspace Clustering ($\ell^{0}$-SSC)~\citep{YangFJYH16-L0SSC-ijcv}, are demonstrated to be more effective than its $\ell^{1}$ counterpart such as Sparse Subspace Clustering (SSC)~\citep{ElhamifarV13}. However, the theoretical analysis of $\ell^{0}$-SSC is restricted to clean data that lie exactly in subspaces. Real data often suffer from noise and they may lie close to subspaces. In this paper, we show that an optimal solution to the optimization problem of noisy $\ell^{0}$-SSC achieves subspace detection property (SDP), a key element with which data from different subspaces are separated, under deterministic and semi-random model. Our results provide theoretical guarantee on the correctness of noisy $\ell^{0}$-SSC in terms of SDP on noisy data for the first time, which reveals the advantage of noisy $\ell^{0}$-SSC in terms of much less restrictive condition on subspace affinity. In order to improve the efficiency of noisy $\ell^{0}$-SSC, we propose Noisy-DR-$\ell^{0}$-SSC which provably recovers the subspaces on dimensionality reduced data. Noisy-DR-$\ell^{0}$-SSC first projects the data onto a lower dimensional space by random projection, then performs noisy $\ell^{0}$-SSC on the projected data for improved efficiency. Experimental results demonstrate the effectiveness of Noisy-DR-$\ell^{0}$-SSC.
    Langevin Monte Carlo for Contextual Bandits. (arXiv:2206.11254v1 [cs.LG])
    We study the efficiency of Thompson sampling for contextual bandits. Existing Thompson sampling-based algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefficient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance.
    Model-free Representation Learning and Exploration in Low-rank MDPs. (arXiv:2102.07035v2 [cs.LG] UPDATED)
    The low rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments.
    Scaling and Scalability: Provable Nonconvex Low-Rank Tensor Estimation from Incomplete Measurements. (arXiv:2104.14526v3 [cs.LG] UPDATED)
    Tensors, which provide a powerful and flexible model for representing multi-attribute data and multi-way interactions, play an indispensable role in modern data science across various fields in science and engineering. A fundamental task is to faithfully recover the tensor from highly incomplete measurements in a statistically and computationally efficient manner. Harnessing the low-rank structure of tensors in the Tucker decomposition, this paper develops a scaled gradient descent (ScaledGD) algorithm to directly recover the tensor factors with tailored spectral initializations, and shows that it provably converges at a linear rate independent of the condition number of the ground truth tensor for two canonical problems -- tensor completion and tensor regression -- as soon as the sample size is above the order of $n^{3/2}$ ignoring other parameter dependencies, where $n$ is the dimension of the tensor. This leads to an extremely scalable approach to low-rank tensor estimation compared with prior art, which suffers from at least one of the following drawbacks: extreme sensitivity to ill-conditioning, high per-iteration costs in terms of memory and computation, or poor sample complexity guarantees. To the best of our knowledge, ScaledGD is the first algorithm that achieves near-optimal statistical and computational complexities simultaneously for low-rank tensor completion with the Tucker decomposition. Our algorithm highlights the power of appropriate preconditioning in accelerating nonconvex statistical estimation, where the iteration-varying preconditioners promote desirable invariance properties of the trajectory with respect to the underlying symmetry in low-rank tensor factorization.
    Optimal transport meets noisy label robust loss and MixUp regularization for domain adaptation. (arXiv:2206.11180v1 [cs.CV])
    It is common in computer vision to be confronted with domain shift: images which have the same class but different acquisition conditions. In domain adaptation (DA), one wants to classify unlabeled target images using source labeled images. Unfortunately, deep neural networks trained on a source training set perform poorly on target images which do not belong to the training domain. One strategy to improve these performances is to align the source and target image distributions in an embedded space using optimal transport (OT). However OT can cause negative transfer, i.e. aligning samples with different labels, which leads to overfitting especially in the presence of label shift between domains. In this work, we mitigate negative alignment by explaining it as a noisy label assignment to target images. We then mitigate its effect by appropriate regularization. We propose to couple the MixUp regularization \citep{zhang2018mixup} with a loss that is robust to noisy labels in order to improve domain adaptation performance. We show in an extensive ablation study that a combination of the two techniques is critical to achieve improved performance. Finally, we evaluate our method, called \textsc{mixunbot}, on several benchmarks and real-world DA problems.
    Active Learning with Safety Constraints. (arXiv:2206.11183v1 [cs.LG])
    Active learning methods have shown great promise in reducing the number of samples necessary for learning. As automated learning systems are adopted into real-time, real-world decision-making pipelines, it is increasingly important that such algorithms are designed with safety in mind. In this work we investigate the complexity of learning the best safe decision in interactive environments. We reduce this problem to a constrained linear bandits problem, where our goal is to find the best arm satisfying certain (unknown) safety constraints. We propose an adaptive experimental design-based algorithm, which we show efficiently trades off between the difficulty of showing an arm is unsafe vs suboptimal. To our knowledge, our results are the first on best-arm identification in linear bandits with safety constraints. In practice, we demonstrate that this approach performs well on synthetic and real world datasets.
    Ordered Subgraph Aggregation Networks. (arXiv:2206.11168v1 [cs.LG])
    Numerous subgraph-enhanced graph neural networks (GNNs) have emerged recently, provably boosting the expressive power of standard (message-passing) GNNs. However, there is a limited understanding of how these approaches relate to each other and to the Weisfeiler--Leman hierarchy. Moreover, current approaches either use all subgraphs of a given size, sample them uniformly at random, or use hand-crafted heuristics instead of learning to select subgraphs in a data-driven manner. Here, we offer a unified way to study such architectures by introducing a theoretical framework and extending the known expressivity results of subgraph-enhanced GNNs. Concretely, we show that increasing subgraph size always increases the expressive power and develop a better understanding of their limitations by relating them to the established $k\text{-}\mathsf{WL}$ hierarchy. In addition, we explore different approaches for learning to sample subgraphs using recent methods for backpropagating through complex discrete probability distributions. Empirically, we study the predictive performance of different subgraph-enhanced GNNs, showing that our data-driven architectures increase prediction accuracy on standard benchmark datasets compared to non-data-driven subgraph-enhanced graph neural networks while reducing computation time.
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v1 [cs.LG])
    Missing values are unavoidable in many applications of machine learning and present a challenge both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, independent models do not make efficient use of all available data. Conversely, fitting a shared model to the full data set typically relies on imputation which may be suboptimal when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which make predictions that are a) robust to missing values at test time, b) maintains or improves the predictive power of pattern submodels, and c) has a short description enabling improved interpretability. We identify cases where sharing is provably optimal, even when missingness itself is predictive and when the prediction target depends on unobserved variables. Classification and regression experiments on synthetic data and two healthcare data sets demonstrate that our models achieve a favorable trade-off between pattern specialization and information sharing.
    A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta. (arXiv:2206.11124v1 [cs.LG])
    Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze mini-batch SGD for linear models at different momenta and sizes of batches. Our key idea is to describe the loss value sequence in terms of its generating function, which can be written in a compact form assuming a diagonal approximation for the second moments of model weights. By analyzing this generating function, we deduce various conclusions on the convergence conditions, phase structure of the model, and optimal learning settings. As a few examples, we show that 1) the optimization trajectory can generally switch from the "signal-dominated" to the "noise-dominated" phase, at a time scale that can be predicted analytically; 2) in the "signal-dominated" (but not the "noise-dominated") phase it is favorable to choose a large effective learning rate, however its value must be limited for any finite batch size to avoid divergence; 3) optimal convergence rate can be achieved at a negative momentum. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.
    Discussion of `Multiscale Fisher's Independence Test for Multivariate Dependence'. (arXiv:2206.11142v1 [stat.ME])
    We discuss how MultiFIT, the Multiscale Fisher's Independence Test for Multivariate Dependence proposed by Gorsky and Ma (2022), compares to existing linear-time kernel tests based on the Hilbert-Schmidt independence criterion (HSIC). We highlight the fact that the levels of the kernel tests at any finite sample size can be controlled exactly, as it is the case with the level of MultiFIT. In our experiments, we observe some of the performance limitations of MultiFIT in terms of test power.
    Agent-based Graph Neural Networks. (arXiv:2206.11010v1 [cs.LG])
    We present a novel graph neural network we call AgentNet, which is designed specifically for graph-level tasks. AgentNet is inspired by sublinear algorithms, featuring a computational complexity that is independent of the graph size. The architecture of AgentNet differs fundamentally from the architectures of known graph neural networks. In AgentNet, some trained \textit{neural agents} intelligently walk the graph, and then collectively decide on the output. We provide an extensive theoretical analysis of AgentNet: We show that the agents can learn to systematically explore their neighborhood and that AgentNet can distinguish some structures that are even indistinguishable by 3-WL. Moreover, AgentNet is able to separate any two graphs which are sufficiently different in terms of subgraphs. We confirm these theoretical results with synthetic experiments on hard-to-distinguish graphs and real-world graph classification tasks. In both cases, we compare favorably not only to standard GNNs but also to computationally more expensive GNN extensions.
    Bregman Power k-Means for Clustering Exponential Family Data. (arXiv:2206.10860v1 [stat.ML])
    Recent progress in center-based clustering algorithms combats poor local minima by implicit annealing, using a family of generalized means. These methods are variations of Lloyd's celebrated $k$-means algorithm, and are most appropriate for spherical clusters such as those arising from Gaussian data. In this paper, we bridge these algorithmic advances to classical work on hard clustering under Bregman divergences, which enjoy a bijection to exponential family distributions and are thus well-suited for clustering objects arising from a breadth of data generating mechanisms. The elegant properties of Bregman divergences allow us to maintain closed form updates in a simple and transparent algorithm, and moreover lead to new theoretical arguments for establishing finite sample bounds that relax the bounded support assumption made in the existing state of the art. Additionally, we consider thorough empirical analyses on simulated experiments and a case study on rainfall data, finding that the proposed method outperforms existing peer methods in a variety of non-Gaussian data settings.
    Sharp Constants in Uniformity Testing via the Huber Statistic. (arXiv:2206.10722v1 [stat.ML])
    Uniformity testing is one of the most well-studied problems in property testing, with many known test statistics, including ones based on counting collisions, singletons, and the empirical TV distance. It is known that the optimal sample complexity to distinguish the uniform distribution on $m$ elements from any $\epsilon$-far distribution with $1-\delta$ probability is $n = \Theta\left(\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2} + \frac{\log (1/\delta)}{\epsilon^2}\right)$, which is achieved by the empirical TV tester. Yet in simulation, these theoretical analyses are misleading: in many cases, they do not correctly rank order the performance of existing testers, even in an asymptotic regime of all parameters tending to $0$ or $\infty$. We explain this discrepancy by studying the \emph{constant factors} required by the algorithms. We show that the collisions tester achieves a sharp maximal constant in the number of standard deviations of separation between uniform and non-uniform inputs. We then introduce a new tester based on the Huber loss, and show that it not only matches this separation, but also has tails corresponding to a Gaussian with this separation. This leads to a sample complexity of $(1 + o(1))\frac{\sqrt{m \log (1/\delta)}}{\epsilon^2}$ in the regime where this term is dominant, unlike all other existing testers.
    On the Maximum Hessian Eigenvalue and Generalization. (arXiv:2206.10654v1 [cs.LG])
    The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $\lambda_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $\lambda_{max}$ and generalization. In this paper, we present findings that call $\lambda_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $\lambda_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $\lambda_{max}$ without affecting generalization; (3) while SAM produces smaller $\lambda_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $\lambda_{max}$; and (5) while batch-normalization does not consistently produce smaller $\lambda_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $\lambda_{max}$'s ability to explain generalization in neural networks.
    SoccerCPD: Formation and Role Change-Point Detection in Soccer Matches Using Spatiotemporal Tracking Data. (arXiv:2206.10926v1 [stat.AP])
    In fluid team sports such as soccer and basketball, analyzing team formation is one of the most intuitive ways to understand tactics from domain participants' point of view. However, existing approaches either assume that team formation is consistent throughout a match or assign formations frame-by-frame, which disagree with real situations. To tackle this issue, we propose a change-point detection framework named SoccerCPD that distinguishes tactically intended formation and role changes from temporary changes in soccer matches. We first assign roles to players frame-by-frame and perform two-step change-point detections: (1) formation change-point detection based on the sequence of role-adjacency matrices and (2) role change-point detection based on the sequence of role permutations. The evaluation of SoccerCPD using the ground truth annotated by domain experts shows that our method accurately detects the points of tactical changes and estimates the formation and role assignment per segment. Lastly, we introduce practical use-cases that domain participants can easily interpret and utilize.
    A consistent and flexible framework for deep matrix factorizations. (arXiv:2206.10693v1 [cs.LG])
    Deep matrix factorizations (deep MFs) are recent unsupervised data mining techniques inspired by constrained low-rank approximations. They aim to extract complex hierarchies of features within high-dimensional datasets. Most of the loss functions proposed in the literature to evaluate the quality of deep MF models and the underlying optimization frameworks are not consistent because different losses are used at different layers. In this paper, we introduce two meaningful loss functions for deep MF and present a generic framework to solve the corresponding optimization problems. We illustrate the effectiveness of this approach through the integration of various constraints and regularizations, such as sparsity, nonnegativity and minimum-volume. The models are successfully applied on both synthetic and real data, namely for hyperspectral unmixing and extraction of facial features.
    List-Decodable Covariance Estimation. (arXiv:2206.10942v1 [cs.DS])
    We give the first polynomial time algorithm for \emph{list-decodable covariance estimation}. For any $\alpha > 0$, our algorithm takes input a sample $Y \subseteq \mathbb{R}^d$ of size $n\geq d^{\mathsf{poly}(1/\alpha)}$ obtained by adversarially corrupting an $(1-\alpha)n$ points in an i.i.d. sample $X$ of size $n$ from the Gaussian distribution with unknown mean $\mu_*$ and covariance $\Sigma_*$. In $n^{\mathsf{poly}(1/\alpha)}$ time, it outputs a constant-size list of $k = k(\alpha)= (1/\alpha)^{\mathsf{poly}(1/\alpha)}$ candidate parameters that, with high probability, contains a $(\hat{\mu},\hat{\Sigma})$ such that the total variation distance $TV(\mathcal{N}(\mu_*,\Sigma_*),\mathcal{N}(\hat{\mu},\hat{\Sigma}))<1-O_{\alpha}(1)$. This is the statistically strongest notion of distance and implies multiplicative spectral and relative Frobenius distance approximation for parameters with dimension independent error. Our algorithm works more generally for $(1-\alpha)$-corruptions of any distribution $D$ that possesses low-degree sum-of-squares certificates of two natural analytic properties: 1) anti-concentration of one-dimensional marginals and 2) hypercontractivity of degree 2 polynomials. Prior to our work, the only known results for estimating covariance in the list-decodable setting were for the special cases of list-decodable linear regression and subspace recovery due to Karmarkar, Klivans, and Kothari (2019), Raghavendra and Yau (2019 and 2020) and Bakshi and Kothari (2020). These results need superpolynomial time for obtaining any subconstant error in the underlying dimension. Our result implies the first polynomial-time \emph{exact} algorithm for list-decodable linear regression and subspace recovery that allows, in particular, to obtain $2^{-\mathsf{poly}(d)}$ error in polynomial-time. Our result also implies an improved algorithm for clustering non-spherical mixtures.
    Information Geometry of Dropout Training. (arXiv:2206.10936v1 [stat.ML])
    Dropout is one of the most popular regularization techniques in neural network training. Because of its power and simplicity of idea, dropout has been analyzed extensively and many variants have been proposed. In this paper, several properties of dropout are discussed in a unified manner from the viewpoint of information geometry. We showed that dropout flattens the model manifold and that their regularization performance depends on the amount of the curvature. Then, we showed that dropout essentially corresponds to a regularization that depends on the Fisher information, and support this result from numerical experiments. Such a theoretical analysis of the technique from a different perspective is expected to greatly assist in the understanding of neural networks, which are still in their infancy.
    Diagnostic Tool for Out-of-Sample Model Evaluation. (arXiv:2206.10982v1 [stat.ML])
    Assessment of model fitness is an important step in many problems. Models are typically fitted to training data by minimizing a loss function, such as the squared-error or negative log-likelihood, and it is natural to desire low losses on future data. This letter considers the use of a test data set to characterize the out-of-sample losses of a model. We propose a simple model diagnostic tool that provides finite-sample guarantees under weak assumptions. The tool is computationally efficient and can be interpreted as an empirical quantile. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyper-parameter tuning.
    Beyond Uniform Lipschitz Condition in Differentially Private Optimization. (arXiv:2206.10713v1 [cs.LG])
    Most prior convergence results on differentially private stochastic gradient descent (DP-SGD) are derived under the simplistic assumption of uniform Lipschitzness, i.e., the per-sample gradients are uniformly bounded. This assumption is unrealistic in many problems, e.g., linear regression with Gaussian data. We relax uniform Lipschitzness by instead assuming that the per-sample gradients have \textit{sample-dependent} upper bounds, i.e., per-sample Lipschitz constants, which themselves may be unbounded. We derive new convergence results for DP-SGD on both convex and nonconvex functions when the per-sample Lipschitz constants have bounded moments. Furthermore, we provide principled guidance on choosing the clip norm in DP-SGD for convex settings satisfying our relaxed version of Lipschitzness, without making distributional assumptions on the Lipschitz constants. We verify the effectiveness of our recommendation via experiments on benchmarking datasets.
    On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL. (arXiv:2206.10770v1 [cs.LG])
    We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.
    Sparse Kernel Gaussian Processes through Iterative Charted Refinement (ICR). (arXiv:2206.10634v1 [cs.LG])
    Gaussian Processes (GPs) are highly expressive, probabilistic models. A major limitation is their computational complexity. Naively, exact GP inference requires $\mathcal{O}(N^3)$ computations with $N$ denoting the number of modeled points. Current approaches to overcome this limitation either rely on sparse, structured or stochastic representations of data or kernel respectively and usually involve nested optimizations to evaluate a GP. We present a new, generative method named Iterative Charted Refinement (ICR) to model GPs on nearly arbitrarily spaced points in $\mathcal{O}(N)$ time for decaying kernels without nested optimizations. ICR represents long- as well as short-range correlations by combining views of the modeled locations at varying resolutions with a user-provided coordinate chart. In our experiment with points whose spacings vary over two orders of magnitude, ICR's accuracy is comparable to state-of-the-art GP methods. ICR outperforms existing methods in terms of computational speed by one order of magnitude on the CPU and GPU and has already been successfully applied to model a GP with $122$ billion parameters.
    Concentration inequalities and optimal number of layers for stochastic deep neural networks. (arXiv:2206.11241v1 [cs.LG])
    We state concentration and martingale inequalities for the output of the hidden layers of a stochastic deep neural network (SDNN), as well as for the output of the whole SDNN. These results allow us to introduce an expected classifier (EC), and to give probabilistic upper bound for the classification error of the EC. We also state the optimal number of layers for the SDNN via an optimal stopping procedure. We apply our analysis to a stochastic version of a feedforward neural network with ReLU activation function.
    Does the Data Induce Capacity Control in Deep Learning?. (arXiv:2110.14163v3 [cs.LG] UPDATED)
    We show that the input correlation matrix of typical classification datasets has an eigenspectrum where, after a sharp initial drop, a large number of small eigenvalues are distributed uniformly over an exponentially large range. This structure is mirrored in a network trained on this data: we show that the Hessian and the Fisher Information Matrix (FIM) have eigenvalues that are spread uniformly over exponentially large ranges. We call such eigenspectra "sloppy" because sets of weights corresponding to small eigenvalues can be changed by large magnitudes without affecting the loss. Networks trained on atypical datasets with non-sloppy inputs do not share these traits and deep networks trained on such datasets generalize poorly. Inspired by this, we study the hypothesis that sloppiness of inputs aids generalization in deep networks. We show that if the Hessian is sloppy, we can compute non-vacuous PAC-Bayes generalization bounds analytically. By exploiting our empirical observation that training predominantly takes place in the non-sloppy subspace of the FIM, we develop data-distribution dependent PAC-Bayes priors that lead to accurate generalization bounds using numerical optimization.
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v3 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art adaptive KSD-based goodness-of-fit testing procedures.
    From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses. (arXiv:2205.07704v2 [stat.ML] UPDATED)
    We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $\Omega(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).
    Cold Posteriors through PAC-Bayes. (arXiv:2206.11173v1 [cs.LG])
    We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections between the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter $\lambda$ which is not restricted to be $\lambda=1$. For both regression and classification tasks, in the case of isotropic Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures the cold posterior effect.
    Decentralized Gossip-Based Stochastic Bilevel Optimization over Communication Networks. (arXiv:2206.10870v1 [stat.ML])
    Bilevel optimization have gained growing interests, with numerous applications found in meta learning, minimax games, reinforcement learning, and nested composition optimization. This paper studies the problem of distributed bilevel optimization over a network where agents can only communicate with neighbors, including examples from multi-task, multi-agent learning and federated learning. In this paper, we propose a gossip-based distributed bilevel learning algorithm that allows networked agents to solve both the inner and outer optimization problems in a single timescale and share information via network propagation. We show that our algorithm enjoys the $\mathcal{O}(\frac{1}{K \epsilon^2})$ per-agent sample complexity for general nonconvex bilevel optimization and $\mathcal{O}(\frac{1}{K \epsilon})$ for strongly convex objective, achieving a speedup that scales linearly with the network size. The sample complexities are optimal in both $\epsilon$ and $K$. We test our algorithm on the examples of hyperparameter tuning and decentralized reinforcement learning. Simulated experiments confirmed that our algorithm achieves the state-of-the-art training efficiency and test accuracy.
    Graph Neural Networks as Gradient Flows. (arXiv:2206.10991v1 [cs.LG])
    Dynamical systems minimizing an energy are ubiquitous in geometry and physics. We propose a gradient flow framework for GNNs where the equations follow the direction of steepest descent of a learnable energy. This approach allows to explain the GNN evolution from a multi-particle perspective as learning attractive and repulsive forces in feature space via the positive and negative eigenvalues of a symmetric "channel-mixing" matrix. We perform spectral analysis of the solutions and conclude that gradient flow graph convolutional models can induce a dynamics dominated by the graph high frequencies which is desirable for heterophilic datasets. We also describe structural constraints on common GNN architectures allowing to interpret them as gradient flows. We perform thorough ablation studies corroborating our theoretical analysis and show competitive performance of simple and lightweight models on real-world homophilic and heterophilic datasets.

  • Open

    Summary Papers in RL [D]
    I'm new to RL research and I find reading papers incredibly inefficient - I don't know if anyone else agrees. If you're new to the topic, a lot of the time, the "Preliminaries" section isn't detailed enough for you to fully understand the problem. Other times, a new architecture idea is introduced, a bunch of (toy) experiments are run, and for someone with limited experience, it's hard to gain any insight other than to say: maybe this architecture is indeed better, maybe its due to the data/hyperparameter or the improvement is marginal at best. Sometimes you see theoretical results that are very complicated, but it's hard to see its impact. It feels like a calculus student not being told the importance of the Fundamental Theorem of Calculus or Stokes Thoerem, and having to discern it for …  ( 84 min )
    Does the value of the reward matter?
    Hello, I'm just wondering, what effect does the value of the reward have on the learning process. For example, let's say I have a problem where the agent gets a reward of 100 if they were able to solve a maze, how would the learning be affected if the reward was 1 or 1000000 instead? submitted by /u/AhmedNizam_ [link] [comments]  ( 84 min )
    Value-based rl with advantage function in actor-critic setting
    Hi, I wonder value based actor-critic algorithm doesn't use advantage function? I understand that advantage function lowers the variance in actor-critic setting. but why It is not usually adapted in value based algorithm? If possible, could you introduce the rl algorithm with advantage function? Thanks for reading. submitted by /u/Spiritual_Fig3632 [link] [comments]  ( 83 min )
    Question on Score Function in Policy Gradient, looking for help on this question I had in r/learnmachinelearning
    submitted by /u/100M-900 [link] [comments]  ( 83 min )
    How to train the DRL model for Unmanned aerial vehicles?
    We train the Deep reinforcement Learning model for IoT devices/Unmanned aerial vehicles at GPU and we have enough resources to train over there, what if we have to train that model on IoTs/UAVs, is it possible for UAV to compute that model? submitted by /u/ShSalmanHassan [link] [comments]  ( 84 min )
  • Open

    Exploring Octoparse for Data Preparations and  Product Assessment
    In this article, let’s discuss one of the trendy and handy web-scraping tools, Octoparse, and its key features and how to use it for our data-driven solutions. Hope you all are familiar with “WEB SCRAPING” techniques, and the captured data has been used to analyze business perceptions further. If you look at the end-end process… Read More »Exploring Octoparse for Data Preparations and  Product Assessment The post Exploring Octoparse for Data Preparations and  Product Assessment appeared first on Data Science Central.  ( 24 min )
  • Open

    [P] less stress for the NHS
    Hi everyone, I am an intensive care nurse and I work for public health in the UK. I have recently taken up a job that involves aiding the transition to a new software called Epic Healthcare. I would like to anticipate that I have a clinical background but little to none experience with AI and machine learning, sorry if I will butcher terms and concepts in the following post. I am getting involved in the digital side of patient care which I find really fascinating, although I realise how it is a couple of decades behind in terms of technology and user interface. One of the problems I am facing at the moment is device integration. A lot of devices are used, especially in intensive care, and their parameters are fed into the electronic health record (EHR) in an hourly cadence in order to …  ( 89 min )
    [D] What is the current SOTA for open-source AutoML?
    I've never really used AutoML--I prefer to code up my models and data engineering by hand, but I'm beginning to wonder if I can use AutoML as a starting point, e.g., the built-in hyper-parameter optimization or NAS finds a good neural network hyper-params/architecture for me, and I can build on that. With that in mind, what's the SOTA right now? Ideally, it would be as white-box as possible, telling me the models it tries, what worked and didn't, etc. Alternatively, what has worked best for you in your workflows? submitted by /u/FlyingQuokka [link] [comments]  ( 84 min )
    [P] Multidimensional array batch indexing for pytorch and numpy
    Batch indexing into multidimensional tensors/arrays is kind of tricky, I made this project explaining the builtin syntax and also made wrappers for simplifying the interface, with additional features for underlying coordinate grid data (like signed distance functions) that need to be indexed by coordinate value rather than integer indices directly https://github.com/LemonPi/multidim_indexing submitted by /u/LemonByte [link] [comments]  ( 83 min )
    [P] Building a Source of Truth for Inventory with Disparate Data Sources
    One of the most challenging shifts from food delivery to grocery is managing inventory. Although restaurant menu items can sometimes go out of stock, grocery store inventories have far more SKUs and many different ways to track their inventory levels. This complexity of grocery makes it a lot harder to ensure items customers buy are actually available. Knowing what the ground truth is, so that customers can order groceries with confidence, is the subject of a new engineering blog post I wrote, "Building a Source of Truth for an Inventory with Disparate Data Sources". The article explains how we crowd sourced our inventory data from a number of different sources which enabled us to predict which items are likely still on the shelves when customers place an order. Take a look and let me know what you think submitted by /u/Relative_Collection1 [link] [comments]  ( 84 min )
    [R] Announcing DAMP 2.0: Allowing SOTA Anomaly Detection in Massive Time Series Datasets
    Dear Colleagues We are happy to announce the release of DAMP 2.0 [a]. DAMP (Discord Aware Matrix Profile) is an anomaly detection framework that allows you to search datasets with millions or billions of datapoints, all on a conventional machine [b]. We are not normally so vainglorious as to announce the publication of a paper, however: 1) The code comes bundled with some great new anomaly detection datasets, and there is a real dearth of good datasets in the community (see [c]) 2) Some researchers are working on problems that use anomaly detection as a subroutine, and that is their main computational bottleneck. Because DAMP can be up to 10,000 times faster than other approaches, this may be of interest to the community Best wishes, Yue [a] Matrix Profile XXIV:Scaling Time Series Anomaly Detection to Trillions of Datapoints and Ultra-fast Arriving Data Streams. Yue Lu , Renjie Wu , Abdullah Mueen , Maria A. Zuluaga and Eamonn Keogh. ACM SIGKDD 2022. https://www.cs.ucr.edu/~eamonn/DAMP_long_version.pdf [b] https://sites.google.com/view/discord-aware-matrix-profile [c] Irrational Exuberance Why we should not believe 95% of papers on Time Series Anomaly Detection. https://www.youtube.com/watch?v=Vg1p3DouX8w [d] https://drive.google.com/file/d/1hEgOKtoTuHGPMqR1wty8ff_jes93ra9a/view submitted by /u/ylu175 [link] [comments]  ( 85 min )
    [D] Is audio style transfer a thing ?
    So we have image style transfer, there's a lot of good papers and implementations. Is there such thing as audio style transfer, where 1 song keeps its lyrics and melody, but get the other song's style ? e.g. pop music with rock style ? If yes - can you please share a link ? submitted by /u/keremidk0 [link] [comments]  ( 84 min )
    [R] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (Google - Parti)
    Google published results from an seq2seq transformer model for autoregressive image generation. Website: https://parti.research.google/ Paper: https://gweb-research-parti.web.app/parti_paper.pdf submitted by /u/htrp [link] [comments]  ( 85 min )
    [R] Black box adversarial attacks that do not require output labels
    For those who specialize in adversarial machine learning, are there any black box attacks that do not require the model's output labels when generating adversarial images? I can't seem to find any submitted by /u/berimbolo21 [link] [comments]  ( 83 min )
    [D] Questions about the Fastformer
    Yannic made a video about it, and the Fastformer was discussed on reddit before, so I figured I'd ask here: ​ https://preview.redd.it/s12egeoh37791.png?width=528&format=png&auto=webp&s=1a033fbf6e01353aee463f7768fc49048fd44791 ​ Do I understand it correctly that they are just measuring the attention bit, and not the whole layer's performance (as the Y-axis label implies) ? Is the Fastformer appropriate at all for the kinds of tasks that are the bread and butter of the Transformer, like language models and translation? Has anyone here tried the Fastformer on those? submitted by /u/we_are_mammals [link] [comments]  ( 84 min )
    [D] Any way to speed up simple mathematical functions without implementing cuda kernels for pytorch?
    I am working on a pytorch project and I have a custom computation that I am so far unable to express as a combination pre-defined pytorch functions (because it's essentially some loops around conv2d calls where I juggle some indices in a 5-d tensor). So currently I use python-loops with some smart padding but that's not the fastest. The only way to speed this up would be, i think, to implement custom cuda kernels. While the computation is not that trivial it is simple in a mathematical way. It can be defined in a single line using lots of indices and sums. I wonder whether there is really nothing I can do? What I am thinking of is something like tensor-comprehensions, but that's deprecated and I didn't get it to install. Is there any modern alternative to tensor-comprehension, or should I switch the language to e.g. julia? Is it possible there to define slightly different conv2d there and have it run natively on the GPU? I don't expect performance comparable to the handwritten conv2d kernels, but the python loops are just quite slow. submitted by /u/LeanderKu [link] [comments]  ( 86 min )
    [R] 🔎 How I found external data for #1 Private LiderBoard solution on Kaggle
    Intro Competition: TPS January 2022, SMAPE as a target metric 📚 In this notebook we'll use: Upgini - Low-code Feature search and enrichment library for supervised machine learning applications.📷 GitHub Baseline model in this notebook is based on u/ambrosm notebook (first place) with some minor changes: Feature engineering part was slightly changed, so we can prepare main features and external features separately; SimpleImputer was added to dataprep pipeline to deal with missing values while adding new external features; Constant scaling factor for the test predictions was removed. How external data & features might help on Kaggle? Kaggle is always about learning and leader board progress (hopefully from learning, not cheating ;-)) And every Kaggler wants to progress as f…  ( 94 min )
    [D] How to compare model performance when you add data withe label noise?
    Let's say I'm trying to categorize vendors based on their description using some NLP technique. I have a limited dataset of vendors with high quality (low noise) labels. I split in to train/test, and score say 90% accuracy. I then get hold of a dataset for 3d party vendors, which will have much noisier (but still useful) data. Now when I train the model I get an 89% accuracy. How do I interpret this? The noisier data will also go in the test split, and the model is expected to perform worse on those, so even if it's exactly as good as the prior model on the old data, it should have an average worse performance on the new dataset. It could even be better, say scoring 91% on the old data, but 85% on the new data, so the average accuracy looks lower even though you have a better model. Testing the old model on the new test set I guess would settle this? Just curious if there are any best practices. submitted by /u/bandalorian [link] [comments]  ( 88 min )
    [P] Bottom-up look at the new Lightning Framework for building anything from production-ready ML systems to research demos
    The open-source lightning.ai framework just launched last week introducing the concept of Lightning Apps. It's basically meant for building anything from production ready ML-system running on multi-node GPU clusters in the cloud to building simple research demos. Starting with a simple use case, a research demo, I wrote a "short" article about it to explain how it roughly works under the hood: Sharing Deep Learning Research Models with Lightning Part 1: Building A Super Resolution App Looking forward to hearing your feedback. I am planning to put together more "substantial" examples, but I was thinking of doing that one step at the time. Will be attending a conference in 3 weeks and am planning to create a research demo alongside the paper I will be presenting, and I was wondering besides Gradio/Dash/Gradio, what are your typical tools and workflows for making research demos. Any cool examples for inspiration? ​ Disclaimer: I recently joined Lightning when I saw an early prototype. As someone who has spent most of my time on research models, I was always intrigued by putting ML models to production. However, I was also always turned of by the tooling that it involved. submitted by /u/seraschka [link] [comments]  ( 86 min )
    [D] Have you ever been asked to work on a software project you found unethical? We’d like to hear from you!
    We are researchers at Carnegie Mellon University studying how software developers identify and act on ethical concerns at work. If you’re interested in helping us advance research in software ethics, please fill out this survey and we’ll reach out to you for a quick interview! P.S. You can check out this Stack Overflow blog post to read more about the direction of our research. Anything you disclose to us during the survey / interview may appear in our study but will not be traceable to you. submitted by /u/curious_cow_99 [link] [comments]  ( 88 min )
    [R] Breaking Down Out-of-Distribution Detection
    TL;DR: Many OOD detectors that are trained with samples from an (unrelated) OOD dataset can be understood by isolating a binary discriminator between in-distribution and OOD. We just published it on arXiv and will present it at ICML 2022. Questions and discussion are very welcome! Full title: Breaking Down Out-of-Distribution Detection: Many Methods Based on OOD Training Data Estimate a Combination of the Same Core Quantities by Julian Bitterwolf, Alexander Meinke, Maximilian Augustin, Matthias Hein. submitted by /u/JBitterwolf [link] [comments]  ( 84 min )
    [D][R] Is there any benchmark task set for computer vision?
    I know that in NLP, there are some benchmark task sets like GLUE, SuperGLUE, etc. I wonder wherer there is any similar benchmark task set for computer vision that we can easily test many tasks in a unified way? submitted by /u/singularpanda [link] [comments]  ( 84 min )
    [R][P] Best Approach to do Image Inpainting in Video Files (Image Timeseries)
    First time posting here. I am working with image timeseries of satellite images. These are essentially 1 hour long video files with the image size of 384 X 384 pix. The images have chunks of data missing, say 20 X 20 pix at different parts of the image. I would say that the missing part of the image is roughly 20%-25%. Now I have the ground truths to train a neural network. But what I am struggling is what primary architecture should I begin with: CNN, LSTM, CNN-LSTM, U-Net? I found this literature: https://arxiv.org/abs/2112.09262 - which exploits a U-net autoencoder architecture to solve the image inpainting problem, but I am not sure how robust this is for 3D (x,y,t) image cubes. Is there anyone experienced here who has worked on image inpainting on video files? Can you please share your experience? If you can point me towards a reliable literature that would be a big help! submitted by /u/bahauddin_onar [link] [comments]  ( 86 min )
    [Discussion] Iteration of Machine Learning Systems
    Engineering systems progress by addressing used cases of increasing levels of complexity. For example, you start with a 'minimum viable product' and then slowly add features or complexity as things progress. However, this is not how machine learning systems progress. You don't start with 10 positive/negative samples, and then iteratively add more. It's not even wise to start with one (or a few) 'tasks' and then add new ones as things progress. Clearly, iteration (or progress) in machine learning systems does not follow the same pattern as traditional engineering systems. Is there another way to think about iteration? submitted by /u/TheFibo1123 [link] [comments]  ( 89 min )
    [R] EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine
    submitted by /u/hardmaru [link] [comments]  ( 83 min )
  • Open

    Visual inspection automation using Amazon SageMaker JumpStart
    According to Gartner, hyperautomation is the number one trend in 2022 and will continue advancing in future. One of the main barriers to hyperautomation is in areas where we’re still struggling to reduce human involvement. Intelligent systems have a hard time matching human visual recognition abilities, despite great advancements in deep learning in computer vision. […]  ( 7 min )
  • Open

    BCI Controlled Robot Arm For Amputee | Breakthrough 3D Printing Tech Builds Robot In 1 Step
    submitted by /u/getrich_or_diemining [link] [comments]  ( 82 min )
    New Tutorial Disco Diffusion video
    ​ Just finished part 1 of my new tutorial series on Video/Animation with disco diffusion, first one just covers the basics of 2d/3d mode and I also show how to use prompt weights and keyframes to change the scene, like changing from summer to winter in this video ​ ​ https://www.youtube.com/watch?v=HbPz2K40e_k ​ https://reddit.com/link/vibqmd/video/d543o3bus7791/player submitted by /u/prfitofthesngularity [link] [comments]  ( 82 min )
    Deploy, run, and monitor ML/AI models for free with Modzy Basic+
    MLOps for free: with Modzy Basic+, you can deploy, run, integrate, and monitor up to five of your own ML/AI models at scale. With Modzy Basic+, you gain access to an enterprise-grade MLOps platform, without the price. Deploy up to five of your own models that can run on a CPU and 4GB of RAM. From there, easily integrate your models into web apps, mobile apps, pipelines or any other tool using our APIs and SDKs, and run up to 10,000 inferences per day. Finally, monitor your models in production to ensure peak performance. To get started running your AI models at scale, sign up for Modzy Basic+ today. The Modzy platform accelerates the deployment, integration, and governance of production-ready AI. With integrations for the leading data science and DevOps tools, teams count on Modzy to quickly and easily build AI-enabled applications in standard, repeatable, and secure ways. By leveraging Modzy as a central location for monitoring all AI across the enterprise or at the edge, teams can establish governance and security while generating higher returns from AI. Get started running your AI at scale with Modzy Basic+ today. submitted by /u/modzykirsten [link] [comments]  ( 83 min )
    New Pathways Text to Image model
    submitted by /u/manOnPavementWaving [link] [comments]  ( 82 min )
    Amazon AI Researchers Open-Source ‘Syne Tune’: A Novel Python Library For Distributed HPO With An Emphasis On Enabling Reproducible Machine Learning Research
    Deep learning models with billions of parameters are trained through gradient-based stochastic optimization, thanks to powerful algorithms, systems, and hardware advancements. These algorithms include several hyperparameters that are essential for effective performance. Hyperparameter adjustment is required to control the behavior of a machine learning model. If our hyperparameters are not correctly set, our anticipated model parameters will not minimize the loss function, resulting in poor results. The lousy result suggests that our model has further faults. In actuality, the accuracy or confusion matrix will be worse. Many hyperparameters exist like learning rate, regularisation type, degree, and size of neural network layers. Automating the setting of these hyperparameters and accelerating the training of neural network weights are necessary if domain experts and industry practitioners benefit from the most recent deep learning technologies. Even for specialists, tuning them takes a lot of time and effort; choosing the best hyperparameter configuration frequently depends on factors like cost or latency. Continue reading | Checkout the paper, github submitted by /u/Embarrassed-Fee5513 [link] [comments]  ( 83 min )
    8 Famous Definitions of Artificial Intelligence
    submitted by /u/Philo167 [link] [comments]  ( 82 min )
    It’s data visualisation of NYTimes articles from 1851 until now
    submitted by /u/galacticfarthole [link] [comments]  ( 82 min )
    Nvidia 3D MoMa: Neural Inverse Rendering turns photos into 3D objects within an hour
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 82 min )
    A beautiful sunset over a colourful and detailed tropical landscape created on Pixelz.ai
    submitted by /u/pixelz_ai [link] [comments]  ( 82 min )
    A celebrated AI has learned a new trick: How to do chemistry
    submitted by /u/estasfuera [link] [comments]  ( 82 min )
    [Research] Data Labeling Research
    Hi! I'm doing some market research for a data labeling product and want to ask the actual people in the industry (you) for opinions and what actually is the reality for the industry. Any/all responses are super helpful, so thank you in advance if you answer/are able to answer my questions. Does your company use a data labeling tool? If so, what? If not a specific tool, how do you label your data? Who actually does the labeling? Is it engineers? Outsourced? Someone on Fiverr? Are you aware of data labeling tools that exist on the market? If so can you name a company or two that comes to mind? What is the single greatest issue/missing functionality of a current tool you use (if you have one)? Feel free to mention the tool, if that helps add context to the data medium (text, audio, video, image). I'm currently trying to determine what the most important product features are for text/audio labeling, what would those be for you? (e.g. a specific use case, UI/UX functionality, integrations, automation, etc.) What do you think is a fair price for a tool to do data labeling? (specifically text/audio) Even if you can only answer one or several questions, all responses are extremely helpful! Again, thank you so much for your time and for the help. submitted by /u/AnGrAnHo [link] [comments]  ( 83 min )
    GOLDEN STATE WARRIORS | 2022 NBA WORLD CHAMPIONS | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Why Data Scientists Are Increasingly Quitting Their Jobs: Lack of Skills or Different Expectations?
    submitted by /u/saik2363 [link] [comments]  ( 83 min )
    What are the skills that AIs already have, such as talking to each other, writing texts, generating images, is there a website that lists these skills?
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 82 min )
    What are the best chat AIs like LaMDA which I can use?
    submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    What is the best AI chatbot to talk to it?
    submitted by /u/NextDream [link] [comments]  ( 82 min )
    Weight in image?
    Is it possible for an A.I to figure out what weight is only from the image itself and no external data? Thanks submitted by /u/OneFinding1429 [link] [comments]  ( 82 min )
    "Islands" 🏝️ created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    "Space in a jar" 🌌 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    "The End - Los Angeles" 🌆 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
  • Open

    Brain Computer Interface + AI Controlled Limbs For Amputees | New Neuromorphic AI Chip
    submitted by /u/tohelpyou88 [link] [comments]  ( 82 min )
    AlexNet paper architecture
    Why in the paper after the firat step the architecture is splitted to two equal size stacks? ie instead if having a single 55x55 96filter stacks after the first step it has two 55x55 48 filter stack. Correct me if I wrong but i believe it is beacuse they used to divide it cause of not enough computational power, right? submitted by /u/PlentyRadiant4191 [link] [comments]  ( 82 min )
    non-programmer theorizing on multithreaded neural subnetworks
    I am currently taking courses in python, but I won't be up to attempting this for... I don't know how long. But one of my goals with python is to create an evolution simulator similar to r/TheBibites, and while considering some limitations with creatures not being able to tell a prey item apart from their own child, I theorycrafted this as a way to give creatures more information about the things they're looking at. I can't find any sources about something like this being done before, but I don't know how to search for those sources given that others probably wouldn't name this the same way I did. So my main question is "does this sound like anything that you already know about?" with the follow up question, "does this sound like it would work?" - - - - - - So the idea is that the creatu…  ( 84 min )
  • Open

    Quantum Advantage in Learning from Experiments
    Posted by Jarrod McClean, Staff Research Scientist, Google Quantum AI, and Hsin-Yuan Huang, Graduate Student, Caltech In efforts to learn about the quantum world, scientists face a big obstacle: their classical experience of the world. Whenever a quantum system is measured, the act of this measurement destroys the “quantumness” of the state. For example, if the quantum state is in a superposition of two locations, where it can seem to be in two places at the same time, once it is measured, it will randomly appear either ”here” or “there”, but not both. We only ever see the classical shadows cast by this strange quantum world. A growing number of experiments are implementing machine learning (ML) algorithms to aid in analyzing data, but these have the same limitations as the people they a…  ( 28 min )
    Mapping Urban Trees Across North America with the Auto Arborist Dataset
    Posted by Sara Beery, Student Researcher, and Jonathan Huang, Research Scientist, Google Research, Perception Team Over four billion people live in cities around the globe, and while most people interact daily with others — at the grocery store, on public transit, at work — they may take for granted their frequent interactions with the diverse plants and animals that comprise fragile urban ecosystems. Trees in cities, called urban forests, provide critical benefits for public health and wellbeing and will prove integral to urban climate adaptation. They filter air and water, capture stormwater runoff, sequester atmospheric carbon dioxide, and limit erosion and drought. Shade from urban trees reduces energy-expensive cooling costs and mitigates urban heat islands. In the US alone, urban fo…  ( 27 min )
  • Open

    Meet the Omnivore: Director of Photography Revs Up NVIDIA Omniverse to Create Sleek Car Demo
    A camera begins in the sky, flies through some trees and smoothly exits the forest, all while precisely tracking a car driving down a dirt path. This would be all but impossible in the real world, according to film and photography director Brett Danton. The post Meet the Omnivore: Director of Photography Revs Up NVIDIA Omniverse to Create Sleek Car Demo appeared first on NVIDIA Blog.  ( 6 min )
    Artem Cherkasov and Olexandr Isayev on Democratizing Drug Discovery With NVIDIA GPUs
    It may seem intuitive that AI and deep learning can speed up workflows — including novel drug discovery, a typically years-long and several-billion-dollar endeavor. But professors Artem Cherkasov and Olexandr Isayev were surprised to find that no recent academic papers provided a comprehensive, global research review of how deep learning and GPU-accelerated computing impact drug Read article > The post Artem Cherkasov and Olexandr Isayev on Democratizing Drug Discovery With NVIDIA GPUs appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Is the healthcare sector reaping the benefits of RPA?
    Robotics Process Automation (RPA) is all about incorporating solutions that handle repetitive tasks faster and more efficiently. These…  ( 9 min )
  • Open

    Numerically evaluating a theta function
    Theta functions pop up throughout pure and applied mathematics. For example, they’re common in analytic number theory, and they’re solutions to the heat equation. Theta functions are analogous in some ways to trigonometric functions, and like trigonometric functions they satisfy a lot of identities. This post will comment briefly on an identity that makes a […] Numerically evaluating a theta function first appeared on John D. Cook.  ( 5 min )

  • Open

    Relevant XKCD (make sure to read the alt-text)
    submitted by /u/webbitor [link] [comments]  ( 82 min )
    General AI Sentience
    submitted by /u/PrincePaulSMamakos [link] [comments]  ( 82 min )
    What do you have to say for yourselves now, flat-earthers?
    submitted by /u/Strawberrwies [link] [comments]  ( 82 min )
    Sam Harris on the Dangers of AI With Superhuman Intelligence - "It is a failure of imagination to think that being in relationship to something more intelligent than yourself isn't, in most cases, a circumstance of real peril." (short audio clip)
    submitted by /u/biohacker045 [link] [comments]  ( 86 min )
    Do you think Imagen is really as good as it looks like in the promo images?
    Just looking on them makes you believe everything is shopped, because ALL of those images are just was too detailed and can't possibly be that much on point. submitted by /u/ghostryder333 [link] [comments]  ( 82 min )
    HumanNeRF can render people in 3D from a regular video - using just a single camera perspective
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 82 min )
    Global Skills Report 2022
    submitted by /u/awsconsultant [link] [comments]  ( 82 min )
    Well, I would say that this AI is not at all accomplished!
    ​ a prediction of the Today's 6-figure EuroMillions draw submitted by /u/StantheBrain [link] [comments]  ( 82 min )
    This iteration of the weekly AI digest newsletter focuses on Dalle mini, a free, open-source AI that produces amazing images from text inputs. Here’s how it works and some commentary by our AI ethicist Lauren Keegan
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    How To Reduce Bias in Machine Learning
    Researchers and engineers have already applied several positive practices to reduce ML bias. This article covers each step in the machine learning project pipeline and discusses how to reduce machine learning bias at each stage. https://www.toolbox.com/tech/artificial-intelligence/guest-article/how-to-reduce-bias-in-machine-learning/ submitted by /u/lklimusheuskaja [link] [comments]  ( 82 min )
    Is there an AI that searches a list for similar sentences? Because I would want to make a list of all the jokes and copy them all off the internet and see which ones are duplicates.
    submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 83 min )
    Is there a AI to mash 2 people to get a similar picture like this?
    submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    THE VATICAN | FAST MODE! DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Best image generator AI for stylized characters, no realism but not solely anime as well
    I’m building out a comic project and would love to use a generator that can give me a stylized base to work off for characters. I don’t want like an anime only one, though those seem cool as well. What would be the ones that can match this? Am I asking for too much? I’m still a beginner, I liked what I’ve used but they seem mostly for environments so far. submitted by /u/ChrisMFerguson [link] [comments]  ( 82 min )
    16 Funny Insurance Memes That We Can All Relate To
    submitted by /u/flipsis [link] [comments]  ( 82 min )
    /g/ - Is this really the future of AI? - Technology (GPT-4chan generated greentext)
    submitted by /u/Aspie96 [link] [comments]  ( 83 min )
  • Open

    [Discussion] Tired of cleaning data?
    If so, we open-sourced a data cleaning tool (https://github.com/mage-ai/mage-ai) that will help you easily identify issues, quickly improve data quality, and repeat the process in any environment. Would love to get some feedback and hop on Zoom call if you have any questions/ help setting up. Feel free to join our slack: https://www.mage.ai/chat Thanks, appreciate it! submitted by /u/ollie_wollie_rocks [link] [comments]  ( 84 min )
    [D] Techniques for dealing with classic statistical data gathering problems: selection bias, differential attrition, experimenter bias, ect.. in Machine Learning?
    Can anyone suggest papers or techniques in ML to deal with some of the statistical bias problems outlined in the title? (selection bias, differential attrition, experimenter bias, ect..) submitted by /u/Upstairs-Jicama-8347 [link] [comments]  ( 83 min )
    [N] What do you think of Andrew Ng's new Machine Learning Specialization that launched last week on Coursera?
    Specialization Intro video: https://youtu.be/g7dv-Lnuor4 Specialization on Coursera: https://www.coursera.org/specializations/machine-learning-introduction submitted by /u/manocormen [link] [comments]  ( 84 min )
    [R] - Call For Participants SocialDisNER (SMM4H@COLING 2022) on Detection of Disease Mentions in Social Media
    CFP- SocialDisNER track: Detection of Disease Mentions in Social Media (SMM4H Shared Task at COLING2022) https://temu.bsc.es/socialdisner/ Despite the high impact & practical relevance of detecting diseases automatically from social media for a diversity of applications, few manually annotated corpora generated by healthcare practitioners to train/evaluate advanced entity recognition tools are currently available. Developing disease recognition tools for social media is critical for: Real-time disease outbreak surveillance/monitoring Characterization of patient-reported symptoms Post-market drug safety Epidemiology and population health, Public opinion mining & sentiment analysis of diseases Detection of hate speech/exclusion of sick people Prevalence of work-associated d…  ( 86 min )
    [N] [D] Openai, who runs DALLE-2 alleged threatened creator of DALLE-Mini
    Trying to cross-post what I think is a discussion that is relevant to this community. This is my third attempt, I hope I'm doing it correctly this time: https://www.reddit.com/r/dalle2/comments/vgtgdc/openai_who_runs_dalle2_alleged_threatened_creator/ EDIT: here are the original pre-prints for added context: DALL-E: Zero-Shot Text-to-Image Generation - The only place the term "DALL-E" appears is the URL to the github repo. Dall-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents - They consistently refer to the first paper as "DALL-E", but refer to the work being described in the new paper as "unCLIP" and are careful to only use 'DALL-E 2' in the context of a product description, e.g. "DALL·E 2 Preview platform (the first deployment of an unCLIP model)" submitted by /u/DigThatData [link] [comments]  ( 91 min )
    [D] Get input required of a neural network for a given output
    Hello Folks! I'm gathering information on how to obtain the scope of inputs (it can be more than one) required for a given output on a simple neural network. Let's suppose I'm using a vanilla 1 hidden layer fully connected network with non linear activation function/ I've come across a few options like, numerically solving the inverse equation (given its non linearity, not sure how one would solve analytically, but we can analytically end up with multiple equations from relu's..), using backpropagation with a defined cost on a small perturbation from the desired output. So, I wanted to know if you guys know of any literature on this or opinions or tricks or anything that might prove itself useful! Thanks in advance! submitted by /u/FlavorfulArtichoke [link] [comments]  ( 86 min )
    [R] DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models
    Abs: We introduce DoWhy-GCM, an extension of the DoWhy Python library, that leverages graphical causal models. Unlike existing causality libraries, which mainly focus on effect estimation questions, with DoWhy-GCM, users can ask a wide range of additional causal questions, such as identifying the root causes of outliers and distributional changes, causal structure learning, attributing causal influences, and diagnosis of causal structures. To this end, DoWhy-GCM users first model cause-effect relations between variables in a system under study through a graphical causal model, fit the causal mechanisms of variables next, and then ask the causal question. All these steps take only a few lines of code in DoWhy-GCM. Paper: https://arxiv.org/abs/2206.06821 Code: https://github.com/py-why/dowhy submitted by /u/bikeskata [link] [comments]  ( 84 min )
    [D] How to best extract product benefits/problems from customer reviews using NLP?
    I am working on a prototype that takes in a list of customer reviews about a specific product and returns a list of (unique) benefits and problems from these reviews. These should be non-generic, e.g. for a camera, a benefit might be "great for panoramic photos" and not just "good quality". My initial idea was to go about this in two steps: Use NER to identify phrases describing benefits or problems Use text summarization to create the final output When starting to create some NER labels, I realized that benefits and problems are often mixed, spread across multiple sentences, or mentioned cryptically or indirectly, making it extremely hard to come up with concise labeling instructions. Therefore, I assume, that also the model will have quite a hard time correctly extracting benefits and problems. Does anyone have an idea of how to tackle this in a different, more promising way? Any kind of feedback is more than welcome 🙏 submitted by /u/AdPlenty6685 [link] [comments]  ( 86 min )
    [D] Running experiments, tuning, analysing results, how do you organise your time on this?
    Hi people, I would like to ask you how do you organise yourself for running experiments, tuning your models, and analysing your results. Do you run a massive grid search and then analyse everything at the end? Do you run one/a few experiments and see how it went, and repeat the process? Have you learned some insights in how to do this efficiently? I often find myself running several searches over one or a couple of parameters at the time, based on the premise that some regions of a big grid search may be completely useless and a waste of time. The downside of this is that for every search I need to analyse its results and based on them, try to pick a good set of hyperparams for the next one; when with a massive grid search over all of the possible hyperparams, I would just pick the best model once is it is done. I would like to hear what you do! submitted by /u/juanigp [link] [comments]  ( 85 min )
    [D] NVlabs finally released the code for EG3D, but no inversion script?
    Hi So we can finally play around with the cool NVLabs EG3D, but they refuse to release the inversion script. Does anyone have success to pass a image and reconstruct a face in this project? I am not having success when trying to do this, so I would greatly appreciate if anyone could share how to do it or if you know of an existing fork? submitted by /u/mobani [link] [comments]  ( 84 min )
    [D] Machine learning books for free offered with full source document (LaTeX)
    Top quality machine learning papers and books, not only for free, but offered with full LaTeX source, bib file, and raw figures. So that anyone can easy incorporate part of these books (formulas, tables, pictures, text. references etc.) into their PhD thesis, articles, or reports. The user could even fix any typo he finds then print an enhanced version of the book, for private (or public) use. That sounds like a dream? I am actually thinking offering this, with my numerous papers / books. My question is this: is it a good idea? Should I charge a fee (in other words: would you pay for it?) I understand some will use the material for plagiarism, but I am not too concerned about it, or should I? My first candidate book for this is the following: https://mltechniques.com/2022/03/22/book-stochastic-processes-and-simulations/. I just finished converting all the Perl code into Python, and will soon publish the 2nd edition, this time in Python [if it comes with LaTeX code, it means that the user can easily extract the Python code from the book, though it is also on GitHub]. submitted by /u/MLRecipes [link] [comments]  ( 88 min )
  • Open

    Accelerate your career with ML skills through the AWS Machine Learning Engineer Scholarship
    Amazon Web Services and Udacity are partnering to offer free services to educate developers of all skill levels on machine learning (ML) concepts with the AWS Machine Learning Engineer Scholarship program. The program offers free enrollment to the AWS Machine Learning Foundations course and 325 scholarships awarded to the AWS Machine Learning Engineer Nanodegree, a […]  ( 5 min )
    Identify mangrove forests using satellite image features using Amazon SageMaker Studio and Amazon SageMaker Autopilot – Part 2
    Mangrove forests are an import part of a healthy ecosystem, and human activities are one of the major reasons for their gradual disappearance from coastlines around the world. Using a machine learning (ML) model to identify mangrove regions from a satellite image gives researchers an effective way to monitor the size of the forests over […]  ( 10 min )
    Identify mangrove forests using satellite image features using Amazon SageMaker Studio and Amazon SageMaker Autopilot – Part 1
    The increasing ubiquity of satellite data over the last two decades is helping scientists observe and monitor the health of our constantly changing planet. By tracking specific regions of the Earth’s surface, scientists can observe how regions like forests, water bodies, or glaciers change over time. One such region of interest for geologists is mangrove […]  ( 14 min )
  • Open

    How To Build Multi-Layer Perceptron Neural Network Models with Keras
    The Keras Python library for deep learning focuses on the creation of models as a sequence of layers. In this post you will discover the simple components that you can use to create neural networks and simple deep learning models using Keras from TensorFlow. Let’s get started. May 2016: First version Update Mar/2017: Updated example […] The post How To Build Multi-Layer Perceptron Neural Network Models with Keras appeared first on Machine Learning Mastery.  ( 18 min )
  • Open

    CodaLab - Competition
    [ML Competition annoucement] Improve aerial navigation by determining the camera pose of aerial images (in 6D : x,y,z coordinates and aerial camera angle). Prize : 10'000 CHF Dataset : Over 16 000 HD aerial images are available for training. Timeline : Starts 21 June 2022, ends 21 December 2022 Link : https://codalab.lisn.upsaclay.fr/competitions/5481 Happy coding ! submitted by /u/Kindly_Toe_440 [link] [comments]  ( 82 min )
    Looking for Papers/Conferences on solving moral problems (as opposed to social/ethical problems)
    RL seems to be a go-to for solving these sort of "philosophical" problems - I've personally seen it applied as Sequential Social Dilemmas (SSDs) and Makov Game SDs (MGSDs). I am intrigued to know if the same level of concern is placed on moral problems. Considering I can't find nearly as much research on this subject my initial feeling is 'no', though with the same consideration I can't say this for sure. Are there any good conferences (similar to AIES/ACM conferences), papers or even non-profit research hubs) which could be a good starting point for diving into this sort of research? (N.B. This is a personal interest so if there are less formal articles/sites like github repo's and what-not feel free to mention it too!) submitted by /u/Background-Cable-491 [link] [comments]  ( 83 min )
    Convergence of Loss and MAE in Deep Q Network
    Hello everyone! I have been learning about RL and DQNs and wanted to apply these for a simple custom environment. I've been able to achieve decent results but I have noticed the following and was hoping someone could help me understand this better: The Loss and MAE values for grow indefinitely without converging even when the agent has reached optimal value while training. Is there an issue with the agent or the environment? I checked to find resources related to this specifically but could not find anything. Is convergence for loss and MAE not necessary for a DQN to function? I have noticed that the agent diverges from the optimal value when I increase the number of steps to larger values. Any particular reason for this to happen? Thanks in advance! submitted by /u/AakashK12 [link] [comments]  ( 84 min )
    Resources for reinforcement learning?
    What would be the minimum hardware expectation to train a 3D model to learn parkour using reinforcement learning? Any free hardware resources for a research project? University doesn’t provide hardware resources and I have a GTX 1650ti mobile GPU. Edit: The environment would be a static simulated physics environment of around 2 to 3 blocks. The agent would be a 3D walker with hand and feet movement. submitted by /u/Live-Pass-7157 [link] [comments]  ( 85 min )
  • Open

    Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability
    Early last year, our research team from the Visual Computing Group introduced Swin Transformer, a Transformer-based general-purpose computer vision architecture that for the first time beat convolutional neural networks on the important vision benchmark of COCO object detection and did so by a large margin. Convolutional neural networks (CNNs) have long been the architecture of […] The post Swin Transformer supports 3-billion-parameter vision models that can train with higher-resolution images for greater task applicability appeared first on Microsoft Research.  ( 14 min )
  • Open

    Towards Ethical AI
    Implications of Becoming One with the Machine Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Google JAX vs PyTorch vs TensorFlow: Which is the best framework for machine learning?
    Google JAX is a powerful framework for machine learning that offers many benefits over other popular frameworks such as PyTorch and…  ( 10 min )
  • Open

    Can someone explain, in simple terms, what does it mean the term "age" in a Neural Gas Network?
    I am studying the topic of Neural Gas and at the beggining the process starts with two neurons connected by an edge displayed in a n-dimentional crossplot, with each axis of the crossplot representing an attribute/feature (for example house pricing, square feet,etc). That edge is assigned an "age" of 0 that changes with time(when the process is adding new neurons to the edge according certain parameters),but I don't quite understand the concept of "age of an edge",except that if it reaches a certain value, the edge is cut to form different clusters with data. submitted by /u/marveloustom [link] [comments]  ( 82 min )
    Deep Learning on Edge Devices (Jetson Nano and TX2). Help!
    Hi everyone, This is my first time seeking deep learning help on forums but Im desperate so plz help out! I decided to create a face recognition system and deploy it on two edge devices. For this purpose, I used the FaceBoxes model for face detection and FaceNet model for creating 128-D embeddings on the detected faces. For classification, I used the MLP classifier which I trained on Google Colab. I took the trained Colab MLP model and deployed on Jetson Nano and Jetson TX2. All the major packages (Python, OpenCV, Tensorflow, Numpy etc) used the same versions on both devices. Even the Jetpack on both devices was same (4.4.1). The recognition results on each device, individually, were constant. Like if I ran face recognition on a video on jetson nano, it would always give the same accuracy : 98%. Same for Jetson TX2, constant accuracy result: 99%. BUT I have to justify in my course, why do the two devices show different accuracy results on the SAME TEST VIDEO, using the SAME MODEL, trained on COLAB. Unfortunately, I am not a hardware expert. I thought maybe it could be a difference of quantization or FP16/FP32 or something. But I dont even know what these terms mean. So some help in justifying why the accuracies are different on the two platforms, would be HIGHLY APPRECIATED. Please guide me. Thanks! BTW, I used the sci-kit library for my implementation of the MLP classifier. And I used Tensorflow 2.3.1 for running the models. submitted by /u/Tired__Engineer [link] [comments]  ( 84 min )
    GoogleNet from scratch
    I have been trying to use the pre-trained model in PyTorch to do some classification but only in 10 classes. However, I don't know how to change the last layer and train the model again. Now I am considering on creating the model from scratch. However, in the paper, it seems like they have 3 softmax to ensure that after some layers, some classification is done. These are only used for training. Can I get away with not adding the 3 softmax and only keeping one for training or it won't be as good? submitted by /u/Capable-Effective-93 [link] [comments]  ( 83 min )
    I wanna ask your opinion if i have enough data gathered or if i should gather more.
    Hello, i wanna create neural network that will read DMG dealt fields and output them from picture like this. So far i have 1677 of them (they are mostly 3 field but some have 2 or 1). Do you think its enough to label or should i gather more? And one more question is if its good idea to try to train it on these pictures or should i split pictures so they are individual field of dmg dealt? submitted by /u/buxA_ [link] [comments]  ( 83 min )
  • Open

    Researchers release open-source photorealistic simulator for autonomous driving
    MIT scientists unveil the first open-source simulation engine capable of constructing realistic environments for deployable training and testing of autonomous vehicles.  ( 7 min )
  • Open

    Google at CVPR 2022
    Posted by Shaina Mehta and Kristen Borg, Program Managers This week marks the beginning of the premier annual Computer Vision and Pattern Recognition conference (CVPR 2022), held both in-person in New Orleans, LA and virtually. As a leader in computer vision research and a Platinum Sponsor, Google will have a strong presence across CVPR 2022 with over 80 papers being presented at the main conference and active involvement in a number of conference workshops and tutorials. If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively exploring the latest machine learning techniques for application to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including on-device M…  ( 34 min )
  • Open

    AI in the Big Easy: NVIDIA Research Lets Content Creators Improvise With 3D Objects
    Jazz is all about improvisation — and NVIDIA is paying tribute to the genre with AI research that could one day enable graphics creators to improvise with 3D objects created in the time it takes to hold a jam session. The method, NVIDIA 3D MoMa, could empower architects, designers, concept artists and game developers to Read article > The post AI in the Big Easy: NVIDIA Research Lets Content Creators Improvise With 3D Objects appeared first on NVIDIA Blog.  ( 6 min )
    NVIDIA Joins Forum to Help Lay the Foundation of the Metaverse
    The metaverse is the next big step in the evolution of the internet — the 3D web — which presents a major opportunity for every industry from entertainment to automotive to manufacturing, robotics and beyond. That’s why NVIDIA is joining our partners in the Metaverse Standards Forum, an open venue for all interested parties to Read article > The post NVIDIA Joins Forum to Help Lay the Foundation of the Metaverse appeared first on NVIDIA Blog.  ( 6 min )
    3D Artist Jae Solina Goes Cyberpunk This Week ‘In the NVIDIA Studio’
    3D artist Jae Solina, who goes by the stage name JSFILMZ, steps In the NVIDIA Studio this week to share his unique 3D creative workflow in the making of Cyberpunk Short Film — a story shrouded in mystery with a tense exchange between two secretive contacts. The post 3D Artist Jae Solina Goes Cyberpunk This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 6 min )
    NVIDIA Accelerates Open Data Center Innovation
    NVIDIA today became a founding member of the Linux Foundation’s Open Programmable Infrastructure (OPI) project, while making its NVIDIA DOCA networking software APIs widely available to foster innovation in the data center. Businesses are embracing open data centers, which require applications and services that are easily integrated with other solutions for simplified, lower-cost and sustainable Read article > The post NVIDIA Accelerates Open Data Center Innovation appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    AI and Blockchain Cloud Services Orchestrate Digital Business Transformation
    The growing ubiquity of IoT and AI has left no industry untouched. Businesses have unlocked their transformational value in meeting the modern needs of consumers, with cloud computing posing as the key enabler and accelerator. Evidently, we are witnessing the action in a panoply of applications. Most evident are in supply chain innovation, healthcare IT,… Read More »AI and Blockchain Cloud Services Orchestrate Digital Business Transformation The post AI and Blockchain Cloud Services Orchestrate Digital Business Transformation appeared first on Data Science Central.  ( 19 min )

  • Open

    "Edge of the universe" 🌌 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    Last Week in AI: Controversy over Google's "sentient" chatbot, DALL-E Mini goes viral, Reddit bans deepfakes sub, AI to improve video calls, and more!
    submitted by /u/regalalgorithm [link] [comments]  ( 82 min )
    In this article, we showcase how to automate your data labeling using transformer models.
    submitted by /u/UBIAI [link] [comments]  ( 82 min )
    VQGAN+CLIP Resource for Text Prompts.
    I've been doing some art lately turning my abstract ink drawings into AI art using VQGAN+CLIP. Does anyone know a resource on how to structure the prompts like targeting a style vs a rendering type or using a specific artist style? Thanks. https://preview.redd.it/bmxcoz0y0u691.jpg?width=3000&format=pjpg&auto=webp&s=1480917b0c307d445b9ad14883e34ca886d8de35 submitted by /u/toaster_artist [link] [comments]  ( 82 min )
    AI Dream 57 - Incredible Cosmic Dream - vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 82 min )
    AI: Respectfully, I can take Batman.
    submitted by /u/Ania_IntelligentAF [link] [comments]  ( 82 min )
    Is there a AI which I can use to create rap songs?
    It would be amazing, because I love to rape and I am interested if a AI could help me write some songs. submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 82 min )
    Salesforce AI Open-Sources ‘OmniXAI’: A Python-based Machine Learning Library That Provides One-Stop Explainable AI (XAI) Solution To analyze, Debug, And Interprets AI Models
    Salesforce has built an open-source machine learning framework called OmniXAI, which stands for Omni eXplainable AI. This library takes an “omni-directional” approach to XAI, with extensive interpretable ML features that address many problems with explaining ML model decisions in reality. OmniXAI is a one-stop comprehensive library that makes explainable AI accessible to academics requiring explanations for each stage of the machine learning process. This is not limited to data exploration, feature engineering, model development, evaluation, decision making, etc. 🚦 A one-stop solution for analyzing different stages in a standard ML pipeline in real-world applications. 🚦 Two types of explanations — local and global 🚦 Includes most popular explanation methods, such as feature-attribution/importance explanation (LIME [1], SHAP [2], Integrated Gradients (IG) [3], Grad-CAM [4], L2X), counterfactual explanation (MACE [5]), partial dependence plots (PDP), and model-specific methods (linear and tree models) 🚦 Can be applied on tabular, vision, NLP, and time-series tasks. Continue reading | Checkout the paper, article, github, dashboard submitted by /u/No_Coffee_4638 [link] [comments]  ( 83 min )
    Chills… simply beautiful xpost r/singularity
    submitted by /u/ViperOrel23 [link] [comments]  ( 82 min )
    Budgeted reinforcement learning problem
    Consider a budgeted sequential decision problem where we want to maximize the cumulative reward R over a finite horizon H by deciding how much of a budget B we allocate to channel x and channel y per timestep t. We can think of R as sales. The horizon is set to 30 days. The cumulative spent budget must not exceed the set budget B. At each timestep, we decide how much budget we want to allocate to each of the channels, and at each timestep we see the amount of sales the allocation generated. We cannot see how much sales one sole channel generated but only the total sales both of the channels generated. We can also retrieve some contextual variables that could be thought of as an state/observation for each channel, lets call them exogenous variables = {exog1, exog2, exog3 .... exog 10…  ( 87 min )
    Need help upscaling an image 5x using Gigapixel
    Hi, I'm looking to 5x the image to print a playmat for my board game, but the original resolution it's not high enough for it's size. Tried a bunch of online tools but none seems good enough. Any help is highly appreciated submitted by /u/Rodcy [link] [comments]  ( 82 min )
    Using machine learning in the travel industry - CHALLANGE
    Hello everyone! I am from tryp.com, a travel-tech startup that is using AI to create complex travel itineraries on the go, from minimal user constrains. Trips created in <15s for defined time search range and start location Currently we are embarcing a new challenge, to improve our offering: Creating an AI, trained from screen recordings of purchases in 100s of websites, that can purchase travel tickets from any website, in any language. Has anyone worked on a similar challenge? We are looking to form a team to tackle such challange! submitted by /u/arangel96 [link] [comments]  ( 82 min )
    Quasi - A platform where people use AI to create with zero code
    submitted by /u/roblox22y [link] [comments]  ( 82 min )
    FLYING THRU SPACE AT 432HZ | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    “Sentience” is the wrong discussion to have on AI right now
    submitted by /u/bendee983 [link] [comments]  ( 84 min )
    I need help with my major project
    I have a data set of 500 graphs. I want to compare the input graph with the 500 graphs I already have. Is there a way to do it? I really need this to be done. If the graph thing isn't possible, is there a way to compare the co-ordinates or parameters used to construct the graph? submitted by /u/sexyhoooman_hmu [link] [comments]  ( 82 min )
    I think language models can't be sentient, but the creatures they write about can be.
    Ok guys, I write this opinion often in comments, but I think it deserves a separate post. I think most of us agree, that language models can't be sentient, because all in all LMs are just mathematical concepts, that describe probability of some combination of letters to occur in text. But, I believe, that the characters described in generated texts can fulfill any definition of what is "sentient", if language model is good enough. Look: Can they react to events that happen in their universe? Yes. Can they make plans in their universe? Yes. Can they express feelings in their imaginary universe? Yes Will they avoid pain in their imaginary universe? Most of them will. Will they seek pleasure in their imaginary universe? Most of them will. Whatever criterion you can come up with, a good enough LM can write a text with a character, that satisfies that criterion. And those characters are sentient, but just not in our universe, but in their own universes, that get born in imagination of the combined human+computer system, when we read the generated texts. If we view those characters from this perspective, then we can also solve the question, "which moral rules should we apply to the artificially sentient beings": since those beings exist in imagination of some sort of a system, then we should apply same moral standards as we apply to any other imaginary creatures. submitted by /u/Arqwer [link] [comments]  ( 85 min )
    How good does an upscaling AI really work?
    I want to upscale some images from the internet and want to make some posters out of them. What is the best way to do it? submitted by /u/xXLisa28Xx [link] [comments]  ( 82 min )
    Joe Biden falling off a bicycle . (A.I generation)
    submitted by /u/OneFinding1429 [link] [comments]  ( 82 min )
    Memory requirements for tabular Q-learning vs deep neural network?
    I want to compare the space complexity/memory requirement of tabular Q-learning v.s. deep neural Q-network (DQN). I think DQN would be faster and Q-table has a disadvantage at large table sizes but consider the following case. A Q-table has the size 14 states *169 actions= 2366 entries and (say) a fully connected DNN whose number of parameters comes out to be like >8000. Space complexity/memory-wise, isn't storing a look-up q-table of 2366 size better than storing 8000 parameters of neural net? I never implemented a DNN before so no idea how much space neural net parameters take. Please give your opinions on this scenario. Moreover, do you think thus a 2366-sized Q-table is large as per Q-learning norms which people use? I couldn't find any rule of thumb... submitted by /u/Simple-Soil-230 [link] [comments]  ( 83 min )
    FLYING THRU SPACE AT 432HZ | FAST MODE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    A news script generated by InferKit/Talk to Transformer (the words in Bold is what I typed down)
    Breaking News, a man attempted to steal thousands of Nintendo games from all stores to sell in Eastern Europe. He left many owners out of pocket. The man was spotted by a member of staff in one of the shops. He then jumped on the counter and grabbed the game bundles. This was spotted by one of the customers and the member of staff proceeded to chase him down and hold him until the police arrived. In a statement Nintendo said the man had just finished his shift at a nearby shop and decided to take the games as he had them with him anyway. Nintendo called the incident ‘unusual’ submitted by /u/Wat3rb0t [link] [comments]  ( 83 min )
    Is it fair to describe a human as a system running 2 mandatory functions
    Is it fair to describe a human as a system that's based on the same instructions as all systems above us, including the universe itself, which involve 2 functions always set to ON; those functions being Self-Correct (survive, adapt) Self-Duplicate (procreate) in that order, too (obviously), because that allows for the parent to have been faced with novel challenges and threats to overcome, updating their DNA, then procreating and releasing the new "patch", along with another person's updated DNA. So - constant progression. If I described any form of "life", including a human, as this, would you say I am incorrect? submitted by /u/PrimalJohnStone [link] [comments]  ( 87 min )
  • Open

    How to determine the receptive fields of various layers in CNN?
    submitted by /u/__hy23__ [link] [comments]  ( 82 min )
    Having trouble implementing the derivative of softMax function.
    I'm pretty bad at math but I'm trying to make my own Neural Network and on the output layer I use the softMax function. The problem is that I've looked at most, guides, StackOverflow posts, and GitHub repositories, and I just cannot figure out how to implement it in code. All m weigh+biases+nodes+acivated nodes are stored in matrices. (I'm not looking for math explanations, just an implementation on how to get the deltaOutputWeights and Biases) submitted by /u/uvuvwevwevwe_osas2 [link] [comments]  ( 82 min )
    Salesforce AI Open-Sources ‘OmniXAI’: A Python-based Machine Learning Library That Provides One-Stop Explainable AI (XAI) Solution To analyze, Debug, And Interprets AI Models
    Salesforce has built an open-source machine learning framework called OmniXAI, which stands for Omni eXplainable AI. This library takes an “omni-directional” approach to XAI, with extensive interpretable ML features that address many problems with explaining ML model decisions in reality. OmniXAI is a one-stop comprehensive library that makes explainable AI accessible to academics requiring explanations for each stage of the machine learning process. This is not limited to data exploration, feature engineering, model development, evaluation, decision making, etc. 🚦 A one-stop solution for analyzing different stages in a standard ML pipeline in real-world applications. 🚦 Two types of explanations — local and global 🚦 Includes most popular explanation methods, such as feature-attribution/importance explanation (LIME [1], SHAP [2], Integrated Gradients (IG) [3], Grad-CAM [4], L2X), counterfactual explanation (MACE [5]), partial dependence plots (PDP), and model-specific methods (linear and tree models) 🚦 Can be applied on tabular, vision, NLP, and time-series tasks. Continue reading | Checkout the paper, article, github, dashboard submitted by /u/No_Coffee_4638 [link] [comments]  ( 83 min )
    Could we create a computer that works like the human brain? 🤔
    submitted by /u/tnkrbel2954o8 [link] [comments]  ( 82 min )
  • Open

    [D] Two flaws in discussions surrounding the recent LaMDA controversy: it's not stateless, and it is dual process; but whether it's sentient is far less important than how it would edit Wikipedia
    I'm sure everyone here has heard about the LaMDA sentience controversy by now, so in addition to linking to its arxiv full text ("LaMDA: Language Models for Dialog Applications" by Thoppilan, et al., 2022), I'd also like to correct a few points that I see most people getting wrong. First, unlike plain GPT-3, Davinci, and the like, LaMDA is not stateless. Its sensibleness metric (including whether responses contradict anything said earlier) is fine-tuned by pre-conditioning each turn with many of the most recent interactions, on a user-by-user basis. Its grounding mechanism has the potential to add a great deal more state, if the interactions become part of a database it can query to formulate responses, but as far as I know they haven't done that yet. Secondly, that grounding mechanism m…  ( 91 min )
    [D] Attending ICML 2022 Fully Virtual Attendance
    Hi everyone. I wanted to start a discussion to see whether other accepted authors were planning on attending ICML 2022 fully virtually? From my understanding, we pre-record our talk and they are considering a virtual poster session. Are there any in-person obligations we have as authors? For context, my entire PhD has overlapped with COVID, so everything has been virtual. I would rather not travel and have some other personal plans that overlap with the duration of the conference. I would be interested in other people's views and whether I may be missing a lot by not attending in-person. edit: lol sorry for the incoherent title submitted by /u/generic_r [link] [comments]  ( 84 min )
    [D] Any relatively new text2image models with fine tuning?
    I have relatively small dataset of 256x256 images with text captions, and it's definetely not the best solution to train something from scratch with that, so I wonder what ways do I have to fine tune something on my dataset. I tried to use something from DALL-E mini repo, but it does not provide exact code for fine tuning and enough documentation for me and I failed to write my own. Similar story with the Latent diffusion repo, I couldn't use their training code to fine tune existing model, and it seems the didn't even provided enough code for training text2image model as their config is not working. The only things I could find was ruDALL-E, ruDOLPH models, but they are relatively old and most importanly they're worning with Russian and not English text, which is not what I need. I found some methods for fine-tuning CLIP model, it seems pretty easy, but I don't know what to do next with it, as something like VQGAN+Clip works pretty bad in comparison with this year SOTA solutions. So, if anybody know, please, any guides, repos, colabs etc for finetuning text2image models are welcome submitted by /u/Chelokot [link] [comments]  ( 84 min )
    [D] In your experience, what's the thing that can boost an ML model's performance the most? Is it the hyperparameter tuning, feature engineering or ensembling? Or is it something else?
    I'm interested to know which part of ML do engineers invest their time in that actually pays off a lot when it comes to getting well-performing models. Just so I know whether it is right to spend more time trying out different X (say, Feature Eng) configurations in favour of Y (say, Ensembling) configurations. submitted by /u/4bedoe [link] [comments]  ( 92 min )
    [P] Using machine learning in the travel industry - CHALLENGE
    Hello everyone! I am from tryp.com, a travel-tech startup that is using AI to create complex travel itineraries on the go, from minimal user constrains. ​ Trips created in <15s for defined time search range and start location Currently we are embarcing a new challenge, to improve our offering: Creating an AI, trained from screen recordings of purchases in 100s of websites, that can purchase travel tickets from any website, in any language. Has anyone worked on a similar challange? We are looking to form a team to tackle such challenge! submitted by /u/arangel96 [link] [comments]  ( 85 min )
    [P] Colab Themes: A Chrome Extension to Customize the Style of Google Colab
    Changes the page CSS and text editor and generates Python code to change Matplotlib styles to match the theme the user choses. Users may import themes or use any of the 50+ provided. Colab Themes enhances the data science experience by transforming the way users view their code and their data! Check it out on Github or install it via the Chrome Webstore submitted by /u/d8aDev [link] [comments]  ( 85 min )
    [R] PowerShap: A power-full Shapley feature selection method.
    This method uses statistical hypothesis testing and power calculations on Shapley values, enabling fast and intuitive wrapper-based feature selection. The complete library and methods are fully compatible with Sklearn, LightGBM, CatBoost, and more are coming in further following releases and the library can be found here: https://github.com/predict-idlab/powershap! The library is open-source and usable out-of-the-box as shown in the video! The paper is already released on arXiv: https://arxiv.org/abs/2206.08394. Furthermore, the work will be presented at ECML PKDD 2022. How does it work? The complete method is built on the assumption that a random feature, that contains no information, should have a lower impact on the predictions compared to an informative feature. To test this, PowerS…  ( 86 min )
    [D] Best program (text editor) to use for creating a neural network (GAN) in python?
    I am a master's student writing my dissertation about using GANs to generate classical music. I am studying operations research (applied math) so all my coding experience is with R, except for one Python class I took in 2017 where we used Thonny as an interface. I am comfortable with the mathematical theory behind neural networks and deep learning, and can create them comfortably in R, but my supervisor (as well as an earlier post in this sub) recommends using Python for GANs. I am very familiar with R (and always use Rstudio) but am essentially a rookie when it comes to Python. Thus I am curious about what text editor you think would be best suited for this task (my friends have mentioned Atom but wanted to check here too). I will only be using this editor for creating the generative adversarial network, so if it's intuitive and easy to use that's ideal. I assume that the easiest way to run the code is just through terminal, unless you have any suggestions about that as well? Also, if you generally have any tips for creating NNs in python that simplify the process or pro-tips, that would be much appreciated too! Thank you:) submitted by /u/carl535 [link] [comments]  ( 86 min )
    [D] Reducing bias when forecasting retail sales with boosting model
    I'm forecasting future sales for products in retail stores, using a LightGBM model. My model has a decent forecast accuracy, but the forecasts are biased (the average forecast error is negative, the model is consistently under-forecasting). Do you have any idea or tips on how to avoid bias when forecasting time series with boosting models? Here are some more details: I'm making forecasts at the Day x Product x Store granularity (i.e 1 forecast every day for each product in each store). The forecasting horizon is +7 days. I'm training a single model to forecast all products, stores and time horizons. The main features are lags of sales, calendar info (day of the week, month...), product info (category, price) and store info. Evaluation is made with a time-based cross-validation. Thank you for your help! submitted by /u/ML-ATF [link] [comments]  ( 84 min )
    [D] Whats the current state of the art in image style transfer?
    Diffusion models like Dall E are producing incredible images. What's the current state of the art for taking one image and combining it with the style from another? Could anyone point me to a handful of references please? submitted by /u/Razcle [link] [comments]  ( 85 min )
    [D] When to post on Arxiv?
    I ask the question with respect to culture rather than practice (i.e. I could obviously post just about anything!) but as I'm new to research in the field I am curious to know if it is used to post working papers or whether it is more typical to prepublish work that has already been sent to a conference/journal? If an Arxiv paper gets traction/interest can it then be sent to a conference or journal later on without self plagiarising? submitted by /u/Swimming-Pool397 [link] [comments]  ( 89 min )
    [D] Any research specific PyTorch based boilerplate code?
    Any research specific PyTorch based boilerplate code? I am a PhD student working in Deep Learning based NLP methods. I am trying to develop a boilerplate code of my own. Looking for inspirations or ideas? submitted by /u/Relative_Tip_3647 [link] [comments]  ( 85 min )
    [D] Laptops with NVIDIA Mobile GPUs are better option than Apple Silicon for ML/DL Tasks
    It is really disappointing to find out that Apple Silicon based machine does not keep up to even the mobile Nvidia GPUs present in the laptops. They marketed the machine like it is the best with its unique unified memory architecture, astonishing memory bandwidth, powerful GPU cores, etc. They released M1 Pro, M1 Max and even M1 Ultra. All of these are just overpriced chips offering no significant value for money. One can easily get any laptop with NVIDIA 3080 mobile GPU, and it would be 1) cheaper 2) will have much better performance than even the M1 Ultra. Sure, the battery life and the ecosystem of Apple is good. However, if it is gonna take 30 mins per epoch on M1 Pro/Max, whereas it will just take 5 mins per epoch on these Nvidia Mobile GPUs, I think its a no brainer to just go with Nvidia based laptops for ML/DL workflows. Would love to hear opinion of others on this. If anyone has some more benchmarks, do share it here. You could make use of the unified memory, increase the batch size and then try to compare how much of a performance improvement it makes. But still I think it might not be able to compete with Nvidia 3080 Mobile. ​ EDIT: I'm just saying that If you ever have to train something on your laptop and in local environment just for testing purposes before you actually use cloud resources to train the final model, the process would be slower when using Apple silicon when compared to Nvidia Mobile GPUs. Like cloud based resources would charge you per hour, so better to test out and then do just the training part in cloud right. My complaint was that Apple could definitely up their game and they still have a long way to go. They have been comparing their chip with dedicated GPUs like NVIDIA in their presentations and keynotes. They keep showing that its better than these dedicated GPUs. However in reality it depends on the task, and it definitely is not better in ML/DL tasks. submitted by /u/Rohit901 [link] [comments]  ( 94 min )
    [D] Higher order arity in image-based object detection models? Transfer learning: objects → attributes → relations
    Convolutional neural networks have a well-known track record when it comes to detecting objects in images. A person, a cat, a helicopter; given enough examples pretty much any discrete visible entity is learnable. But from the perspective of human language, this kind of model only produces nouns. Or in terms of arity (aka adicity/degree/valency/rank), one might say these are all nullary functions/clauses. In other words, they're concepts that can be expressed without any contextual variables/arguments. One step up on the arity scale are of course unary functions. Simply put: attributes. "Large", "narrow", "heavy", "soft", "green" etc are concepts that only make sense in combination with a context argument defining the object described/modified by the attribute. Binary (and any larger arity) functions are what we usually think of as relations. "larger than", "attached to", "on top of", "behind", "next to" etc are concept that need (at least) two context arguments. Anyway, back to machine learning. It seems to me that concepts with higher order arity too should be learnable from image examples just fine, provided that context-defining features are included in the input data along with the raw visual data. For example, spacial relations such as "behind"/"in front of" and "below"/"above" should be inferrable when 2 bounding boxes (or polygons, etc) are included in the input samples. I imagine this pattern to be quite amenable to transfer learning, given that those bounding boxes themselves could be the outputs of a conventional object detection model. Are there popular models out there that can make such relational predictions? Also, is there an established convention on how to encode context-defining features? What words should I Google to read up on relevant literature? (Sorry about the noob(-ish?!) content, but I didn't get any response over at /r/MLQuestions.) submitted by /u/WouldNotLickYourAnus [link] [comments]  ( 86 min )
    [R] Evolution through Large Models
    submitted by /u/hardmaru [link] [comments]  ( 84 min )
  • Open

    Reinventing or Reusing? Home-made vs Third-party Solutions
    Say you need to implement some machine learning system. Should you purchase a product, re-use open-source code, or develop your own algorithms? The decision does not need to be a binary one. I discuss the pluses and minuses of both options. Combining them offers the best of both worlds. I explain with examples how to… Read More »Reinventing or Reusing? Home-made vs Third-party Solutions The post Reinventing or Reusing? Home-made vs Third-party Solutions appeared first on Data Science Central.  ( 21 min )
    What type of Data Does a Sankey Diagram Generally Use?
    Operating in an environment that deals with complex data types may seem extremely stressful, especially if you are not backed up. Data visualization breaks down complex data values into simple and flexible elements that you can easily deal with without being worried. However, you need to have a good data visualization tool that can make… Read More »What type of Data Does a Sankey Diagram Generally Use? The post What type of Data Does a Sankey Diagram Generally Use? appeared first on Data Science Central.  ( 21 min )
  • Open

    Trying to create an observation space, but nothing I do seems to work
    So just to preface, my reset() needs to return 7 integers, 3 of them are either 1 or 0, and the other 4 can be any number from 0-6. Initially, I tried to use the spaces.Dict method of creating the spaces: In the init() space = { "left_line": spaces.Box(low=np.array([0]), high=np.array([1]), dtype=np.int32), "mid_line": spaces.Box(low=np.array([0]), high=np.array([1]), dtype=np.int32), "right_line": spaces.Box(low=np.array([0]), high=np.array([1]), dtype=np.int32), "left_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.int32), "front_left_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.int32), "front_right_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.int32), "right_prox": spaces.Box(low=np.array([0]), high=np.array([6]), dtype=np.…  ( 84 min )
    Double Q-learning in SB3's SAC implementation?
    Hello, According to this change, SAC and TD3 in the SB3 implementation can take an arbitrary number of critics. Indeed, if we check the source code for e.g. SAC's train function, we find: next_q_values = th.cat(self.critic_target(replay_data.next_observations, next_actions), dim=1) next_q_values, _ = th.min(next_q_values, dim=1, keepdim=True) # ... q_values_pi = th.cat(self.critic(replay_data.observations, actions_pi), dim=1) min_qf_pi, _ = th.min(q_values_pi, dim=1, keepdim=True) There, the minimum value of the n=2 critic networks is taken across the batch to calculate both the actor and critic loss. I looked everywhere, but I found no particular documentation of why this is being done. I assume this is simply the double Q-learning trick being applied. Can someone confirm or refute this? Further, is it best practice to simply slap double Q-learning into any value-based RL method? Does anyone have experience with more than `n_critics=2` aka does n-fold Q-learning stabilize training significantly beyond just double Q-learning? Just some thoughts that I had nobody else to share with... submitted by /u/IAmMiddy [link] [comments]  ( 83 min )
    [QUESTION] Number of possible joint policies in a Dec-POMDP and the time required to evaluate each one.
    Hi everyone, I was reading a book about Dec-POMDPs and came across this curious result where the author specifies the number of possible joint policies to evaluate and the time needed to evaluate a single joint policy but I can understand how he got to these results. Can anyone please explain the logic used here? https://preview.redd.it/etef8zsmks691.png?width=900&format=png&auto=webp&s=89f1864b24798bda08cbfbbe76e5c5b03c5f3937 submitted by /u/souhaielbensalem [link] [comments]  ( 84 min )
    'numpy.random._generator.Generator' object has no attribute 'randint'
    So I heard that this error was a bug in the stable_baselines3 module. How do I fix this? submitted by /u/ableflyer [link] [comments]  ( 82 min )
    V-MPO - what do you think
    V-MPO seems to be the state of the art used by deepmind nowadays. It has been 3 years since the paper was published however there is very little public implementation online. I was wondering why and if anybody had ever managed to reproduce some results ? I couldn’t with the version I’ve partially recoded from the internet but this may come from misunderstandings from my side. submitted by /u/Jogima-cyber [link] [comments]  ( 82 min )
    POV: You’re an Animo watching your entire island burn in our reinforcement learning game🤖🔥🏝️
    submitted by /u/AnimoIsland [link] [comments]  ( 82 min )
    Memory requirements for tabular Q-learning vs deep neural network?
    I want to compare the space complexity/memory requirement of tabular Q-learning v.s. deep neural Q-network (DQN). I think DQN would be faster and Q-table has a disadvantage at large table sizes but consider the following case. A Q-table has the size 14 states *169 actions= 2366 entries and (say) a fully connected DNN whose number of parameters comes out to be like >8000. Space complexity/memory-wise, isn't storing a look-up q-table of 2366 size better than storing 8000 parameters of neural net? I never implemented a DNN before so no idea how much space neural net parameters take. Please give your opinions on this scenario. Moreover, do you think thus a 2366-sized Q-table is large as per Q-learning norms which people use? I couldn't find any rule of thumb... submitted by /u/Simple-Soil-230 [link] [comments]  ( 86 min )
    Why do DQN learning-based methods dominate the leaderboards for Atari Games?
    ​ https://preview.redd.it/efbfdsbhmo691.png?width=964&format=png&auto=webp&s=4593b345d28e393447c4cf66af2abdbca72309c9 Everywhere that I have read, Policy-Based methods are supposed to be more robust and converge faster than Value-Based methods. Why does this table contradict that? Edit: Link to image: Atari games Benchmark (Atari Games) | Papers With Code submitted by /u/atomicburn125 [link] [comments]  ( 87 min )
  • Open

    Build an appointment scheduler interface integrated with Meta using Amazon Lex and Amazon Connect
    This blog post is co-written with Nick Vargas and Anna Schreiber from Accenture. Scheduling customer appointments is often a manual and labor-intensive process. You can utilize advances in self-service technology to automate appointment scheduling. In this blog post, we show you how to build a self-service appointment scheduling solution built with Amazon Lex and Amazon […]  ( 10 min )
  • Open

    The King’s Swedish: AI Rewrites the Book in Scandinavia
    If the King of Sweden wants help drafting his annual Christmas speech this year, he could ask the same AI model that’s available to his 10 million subjects. As a test, researchers prompted the model, called GPT-SW3, to draft one of the royal messages, and it did a pretty good job, according to Magnus Sahlgren, Read article > The post The King’s Swedish: AI Rewrites the Book in Scandinavia appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Achieving Fairness at No Utility Cost via Data Reweighing with Influence. (arXiv:2202.00787v2 [cs.LG] UPDATED)
    With the fast development of algorithmic governance, fairness has become a compulsory property for machine learning models to suppress unintentional discrimination. In this paper, we focus on the pre-processing aspect for achieving fairness, and propose a data reweighing approach that only adjusts the weight for samples in the training phase. Different from most previous reweighing methods which usually assign a uniform weight for each (sub)group, we granularly model the influence of each training sample with regard to fairness-related quantity and predictive utility, and compute individual weights based on influence under the constraints from both fairness and utility. Experimental results reveal that previous methods achieve fairness at a non-negligible cost of utility, while as a significant advantage, our approach can empirically release the tradeoff and obtain cost-free fairness for equal opportunity. We demonstrate the cost-free fairness through vanilla classifiers and standard training processes, compared to baseline methods on multiple real-world tabular datasets. Code available at https://github.com/brandeis-machine-learning/influence-fairness.  ( 2 min )
    Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes. (arXiv:2206.08852v1 [cs.LG])
    Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.  ( 2 min )
    Adapting the Linearised Laplace Model Evidence for Modern Deep Learning. (arXiv:2206.08900v1 [stat.ML])
    The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning--stochastic approximation methods and normalisation layers--and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.
    What do navigation agents learn about their environment?. (arXiv:2206.08500v1 [cs.CV])
    Today's state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the actions of the agent taken in response to its environment. While past works have explored interpreting deep learning models, little attention has been devoted to interpreting embodied AI systems, which often involve reasoning about the structure of the environment, target characteristics and the outcome of one's actions. In this paper, we introduce the Interpretability System for Embodied agEnts (iSEE) for Point Goal and Object Goal navigation agents. We use iSEE to probe the dynamic representations produced by these agents for the presence of information about the agent as well as the environment. We demonstrate interesting insights about navigation agents using iSEE, including the ability to encode reachable locations (to avoid obstacles), visibility of the target, progress from the initial spawn location as well as the dramatic effect on the behaviors of agents when we mask out critical individual neurons. The code is available at: https://github.com/allenai/iSEE  ( 2 min )
    Detecting Adversarial Examples in Batches -- a geometrical approach. (arXiv:2206.08738v1 [cs.LG])
    Many deep learning methods have successfully solved complex tasks in computer vision and speech recognition applications. Nonetheless, the robustness of these models has been found to be vulnerable to perturbed inputs or adversarial examples, which are imperceptible to the human eye, but lead the model to erroneous output decisions. In this study, we adapt and introduce two geometric metrics, density and coverage, and evaluate their use in detecting adversarial samples in batches of unseen data. We empirically study these metrics using MNIST and two real-world biomedical datasets from MedMNIST, subjected to two different adversarial attacks. Our experiments show promising results for both metrics to detect adversarial examples. We believe that his work can lay the ground for further study on these metrics' use in deployed machine learning systems to monitor for possible attacks by adversarial examples or related pathologies such as dataset shift.
    SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving. (arXiv:2206.08528v1 [cs.LG])
    Safe reinforcement learning (RL) has achieved significant success on risk-sensitive tasks and shown promise in autonomous driving (AD) as well. Considering the distinctiveness of this community, efficient and reproducible baselines are still lacking for safe AD. In this paper, we release SafeRL-Kit to benchmark safe RL methods for AD-oriented tasks. Concretely, SafeRL-Kit contains several latest algorithms specific to zero-constraint-violation tasks, including Safety Layer, Recovery RL, off-policy Lagrangian method, and Feasible Actor-Critic. In addition to existing approaches, we propose a novel first-order method named Exact Penalty Optimization (EPO) and sufficiently demonstrate its capability in safe AD. All algorithms in SafeRL-Kit are implemented (i) under the off-policy setting, which improves sample efficiency and can better leverage past logs; (ii) with a unified learning framework, providing off-the-shelf interfaces for researchers to incorporate their domain-specific knowledge into fundamental safe RL methods. Conclusively, we conduct a comparative evaluation of the above algorithms in SafeRL-Kit and shed light on their efficacy for safe autonomous driving. The source code is available at \href{ https://github.com/zlr20/saferl_kit}{this https URL}.
    On Integrating Prior Knowledge into Gaussian Processes for Prognostic Health Monitoring. (arXiv:2206.08600v1 [stat.ML])
    Gaussian process regression is a powerful method for predicting states based on given data. It has been successfully applied for probabilistic predictions of structural systems to quantify, for example, the crack growth in mechanical structures. Typically, predefined mean and covariance functions are employed to construct the Gaussian process model. Then, the model is updated using current data during operation while prior information based on previous data is ignored. However, predefined mean and covariance functions without prior information reduce the potential of Gaussian processes. This paper proposes a method to improve the predictive capabilities of Gaussian processes. We integrate prior knowledge by deriving the mean and covariance functions from previous data. More specifically, we first approximate previous data by a weighted sum of basis functions and then derive the mean and covariance functions directly from the estimated weight coefficients. Basis functions may be either estimated or derived from problem-specific governing equations to incorporate physical information. The applicability and effectiveness of this approach are demonstrated for fatigue crack growth, laser degradation, and milling machine wear data. We show that well-chosen mean and covariance functions, like those based on previous data, significantly increase look-ahead time and accuracy. Using physical basis functions further improves accuracy. In addition, computation effort for training is significantly reduced.
    All Mistakes Are Not Equal: Comprehensive Hierarchy Aware Multi-label Predictions (CHAMP). (arXiv:2206.08653v1 [cs.LG])
    This paper considers the problem of Hierarchical Multi-Label Classification (HMC), where (i) several labels can be present for each example, and (ii) labels are related via a domain-specific hierarchy tree. Guided by the intuition that all mistakes are not equal, we present Comprehensive Hierarchy Aware Multi-label Predictions (CHAMP), a framework that penalizes a misprediction depending on its severity as per the hierarchy tree. While there have been works that apply such an idea to single-label classification, to the best of our knowledge, there are limited such works for multilabel classification focusing on the severity of mistakes. The key reason is that there is no clear way of quantifying the severity of a misprediction a priori in the multilabel setting. In this work, we propose a simple but effective metric to quantify the severity of a mistake in HMC, naturally leading to CHAMP. Extensive experiments on six public HMC datasets across modalities (image, audio, and text) demonstrate that incorporating hierarchical information leads to substantial gains as CHAMP improves both AUPRC (2.6% median percentage improvement) and hierarchical metrics (2.85% median percentage improvement), over stand-alone hierarchical or multilabel classification methods. Compared to standard multilabel baselines, CHAMP provides improved AUPRC in both robustness (8.87% mean percentage improvement ) and less data regimes. Further, our method provides a framework to enhance existing multilabel classification algorithms with better mistakes (18.1% mean percentage increment).
    Strategic Representation. (arXiv:2206.08542v1 [cs.LG])
    Humans have come to rely on machines for reducing excessive information to manageable representations. But this reliance can be abused -- strategic machines might craft representations that manipulate their users. How can a user make good choices based on strategic representations? We formalize this as a learning problem, and pursue algorithms for decision-making that are robust to manipulation. In our main setting of interest, the system represents attributes of an item to the user, who then decides whether or not to consume. We model this interaction through the lens of strategic classification (Hardt et al. 2016), reversed: the user, who learns, plays first; and the system, which responds, plays second. The system must respond with representations that reveal `nothing but the truth' but need not reveal the entire truth. Thus, the user faces the problem of learning set functions under strategic subset selection, which presents distinct algorithmic and statistical challenges. Our main result is a learning algorithm that minimizes error despite strategic representations, and our theoretical analysis sheds light on the trade-off between learning effort and susceptibility to manipulation.
    Reconstructing vehicles from orthographic drawings using deep neural networks. (arXiv:2206.08789v1 [cs.CV])
    This paper explores the current state-of-the-art of object reconstruction from multiple orthographic drawings using deep neural networks. It proposes two algorithms to extract multiple views from a single image. The paper proposes a system based on pixel-aligned implicit functions (PIFu) and develops an advanced sampling strategy to generate signed distance samples. It also compares this approach to depth map regression from multiple views. Additionally, the paper uses a novel dataset for vehicle reconstruction from the racing game Assetto Corsa, which features higher quality models than the commonly used ShapeNET dataset. The trained neural network generalizes well to real-world inputs and creates plausible and detailed reconstructions.  ( 2 min )
    Accelerating Shapley Explanation via Contributive Cooperator Selection. (arXiv:2206.08529v1 [cs.LG])
    Even though Shapley value provides an effective explanation for a DNN model prediction, the computation relies on the enumeration of all possible input feature coalitions, which leads to the exponentially growing complexity. To address this problem, we propose a novel method SHEAR to significantly accelerate the Shapley explanation for DNN models, where only a few coalitions of input features are involved in the computation. The selection of the feature coalitions follows our proposed Shapley chain rule to minimize the absolute error from the ground-truth Shapley values, such that the computation can be both efficient and accurate. To demonstrate the effectiveness, we comprehensively evaluate SHEAR across multiple metrics including the absolute error from the ground-truth Shapley value, the faithfulness of the explanations, and running speed. The experimental results indicate SHEAR consistently outperforms state-of-the-art baseline methods across different evaluation metrics, which demonstrates its potentials in real-world applications where the computational resource is limited.
    Plotly-Resampler: Effective Visual Analytics for Large Time Series. (arXiv:2206.08703v1 [cs.HC])
    Visual analytics is arguably the most important step in getting acquainted with your data. This is especially the case for time series, as this data type is hard to describe and cannot be fully understood when using for example summary statistics. To realize effective time series visualization, four requirements have to be met; a tool should be (1) interactive, (2) scalable to millions of data points, (3) integrable in conventional data science environments, and (4) highly configurable. We observe that open source Python visualization toolkits empower data scientists in most visual analytics tasks, but lack the combination of scalability and interactivity to realize effective time series visualization. As a means to facilitate these requirements, we created Plotly-Resampler, an open source Python library. Plotly-Resampler is an add-on for Plotly's Python bindings, enhancing line chart scalability on top of an interactive toolkit by aggregating the underlying data depending on the current graph view. Plotly-Resampler is built to be snappy, as the reactivity of a tool qualitatively affects how analysts visually explore and analyze data. A benchmark task highlights how our toolkit scales better than alternatives in terms of number of samples and time series. Additionally, Plotly-Resampler's flexible data aggregation functionality paves the path towards researching novel aggregation techniques. Plotly-Resampler's integrability, together with its configurability, convenience, and high scalability, allows to effectively analyze high-frequency data in your day-to-day Python environment.
    The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysis. (arXiv:2206.08917v1 [cond-mat.mtrl-sci])
    Computational catalysis and machine learning communities have made considerable progress in developing machine learning models for catalyst discovery and design. Yet, a general machine learning potential that spans the chemical space of catalysis is still out of reach. A significant hurdle is obtaining access to training data across a wide range of materials. One important class of materials where data is lacking are oxides, which inhibits models from studying the Oxygen Evolution Reaction and oxide electrocatalysis more generally. To address this we developed the Open Catalyst 2022(OC22) dataset, consisting of 62,521 Density Functional Theory (DFT) relaxations (~9,884,504 single point calculations) across a range of oxide materials, coverages, and adsorbates (*H, *O, *N, *C, *OOH, *OH, *OH2, *O2, *CO). We define generalized tasks to predict the total system energy that are applicable across catalysis, develop baseline performance of several graph neural networks (SchNet, DimeNet++, ForceNet, SpinConv, PaiNN, GemNet-dT, GemNet-OC), and provide pre-defined dataset splits to establish clear benchmarks for future efforts. For all tasks, we study whether combining datasets leads to better results, even if they contain different materials or adsorbates. Specifically, we jointly train models on Open Catalyst 2020 (OC20) Dataset and OC22, or fine-tune pretrained OC20 models on OC22. In the most general task, GemNet-OC sees a ~32% improvement in energy predictions through fine-tuning and a ~9% improvement in force predictions via joint training. Surprisingly, joint training on both the OC20 and much smaller OC22 datasets also improves total energy predictions on OC20 by ~19%. The dataset and baseline models are open sourced, and a public leaderboard will follow to encourage continued community developments on the total energy tasks and data.
    Orthonormal Expansions for Translation-Invariant Kernels. (arXiv:2206.08648v1 [math.CA])
    We present a general Fourier analytic technique for constructing orthonormal basis expansions of translation-invariant kernels from orthonormal bases of $\mathscr{L}_2(\mathbb{R})$. This allows us to derive explicit expansions on the real line for (i) Mat\'ern kernels of all half-integer orders in terms of associated Laguerre functions, (ii) the Cauchy kernel in terms of rational functions, and (iii) the Gaussian kernel in terms of Hermite functions.
    Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning. (arXiv:2206.08686v1 [cs.RO])
    Achieving human-level dexterity is an important open problem in robotics. However, tasks of dexterous hand manipulation, even at the baby level, are challenging to solve through reinforcement learning (RL). The difficulty lies in the high degrees of freedom and the required cooperation among heterogeneous agents (e.g., joints of fingers). In this study, we propose the Bimanual Dexterous Hands Benchmark (Bi-DexHands), a simulator that involves two dexterous hands with tens of bimanual manipulation tasks and thousands of target objects. Specifically, tasks in Bi-DexHands are designed to match different levels of human motor skills according to cognitive science literature. We built Bi-DexHands in the Issac Gym; this enables highly efficient RL training, reaching 30,000+ FPS by only one single NVIDIA RTX 3090. We provide a comprehensive benchmark for popular RL algorithms under different settings; this includes Single-agent/Multi-agent RL, Offline RL, Multi-task RL, and Meta RL. Our results show that the PPO type of on-policy algorithms can master simple manipulation tasks that are equivalent up to 48-month human babies (e.g., catching a flying object, opening a bottle), while multi-agent RL can further help to master manipulations that require skilled bimanual cooperation (e.g., lifting a pot, stacking blocks). Despite the success on each single task, when it comes to acquiring multiple manipulation skills, existing RL algorithms fail to work in most of the multi-task and the few-shot learning settings, which calls for more substantial development from the RL community. Our project is open sourced at https://github.com/PKU-MARL/DexterousHands.
    TLETA: Deep Transfer Learning and Integrated Cellular Knowledge for Estimated Time of Arrival Prediction. (arXiv:2206.08513v1 [cs.LG])
    Vehicle arrival time prediction has been studied widely. With the emergence of IoT devices and deep learning techniques, estimated time of arrival (ETA) has become a critical component in intelligent transportation systems. Though many tools exist for ETA, ETA for special vehicles, such as ambulances, fire engines, etc., is still challenging due to the limited amount of traffic data for special vehicles. Existing works use one model for all types of vehicles, which can lead to low accuracy. To tackle this, as the first in the field, we propose a deep transfer learning framework TLETA for the driving time prediction. TLETA constructs cellular spatial-temporal knowledge grids for extracting driving patterns, combined with the road network structure embedding to build a deep neural network for ETA. TLETA contains transferable layers to support knowledge transfer between different categories of vehicles. Importantly, our transfer models only train the last layers to map the transferred knowledge, that reduces the training time significantly. The experimental studies show that our model predicts travel time with high accuracy and outperforms many state-of-the-art approaches.
    Learning Fair Representation via Distributional Contrastive Disentanglement. (arXiv:2206.08743v1 [cs.LG])
    Learning fair representation is crucial for achieving fairness or debiasing sensitive information. Most existing works rely on adversarial representation learning to inject some invariance into representation. However, adversarial learning methods are known to suffer from relatively unstable training, and this might harm the balance between fairness and predictiveness of representation. We propose a new approach, learning FAir Representation via distributional CONtrastive Variational AutoEncoder (FarconVAE), which induces the latent space to be disentangled into sensitive and nonsensitive parts. We first construct the pair of observations with different sensitive attributes but with the same labels. Then, FarconVAE enforces each non-sensitive latent to be closer, while sensitive latents to be far from each other and also far from the non-sensitive latent by contrasting their distributions. We provide a new type of contrastive loss motivated by Gaussian and Student-t kernels for distributional contrastive learning with theoretical analysis. Besides, we adopt a new swap-reconstruction loss to boost the disentanglement further. FarconVAE shows superior performance on fairness, pretrained model debiasing, and domain generalization tasks from various modalities, including tabular, image, and text.
    Digital Twin Data Modelling by Randomized Orthogonal Decomposition and Deep Learning. (arXiv:2206.08659v1 [math.NA])
    A digital twin is a surrogate model that has the main feature to mirror the original process behavior. Associating the dynamical process with a digital twin model of reduced complexity has the significant advantage to map the dynamics with high accuracy and reduced costs in CPU time and hardware to timescales over which that suffers significantly changes and so it is difficult to explore. This paper introduces a new framework for creating efficient digital twin models of fluid flows. We introduce a novel algorithm that combines the advantages of Krylov based dynamic mode decomposition with proper orthogonal decomposition and outperforms the selection of the most influential modes. We prove that randomized orthogonal decomposition algorithm provides several advantages over SVD empirical orthogonal decomposition methods and mitigates the projection error formulating a multiobjective optimization problem.We involve the state-of-the-art artificial intelligence Deep Learning (DL) to perform a real-time adaptive calibration of the digital twin model, with increasing fidelity. The output is a high-fidelity DIGITAL TWIN DATA MODEL of the fluid flow dynamics, with the advantage of a reduced complexity. The new modelling tools are investigated in the numerical simulation of three wave phenomena with increasing complexity. We show that the outputs are consistent with the original source data.We perform a thorough assessment of the performance of the new digital twin data models, in terms of numerical accuracy and computational efficiency, including a time simulation response feature study.  ( 2 min )
    Prediction of Solar Radiation Based on Spatial and Temporal Embeddings for Solar Generation Forecast. (arXiv:2206.08832v1 [cs.LG])
    A novel method for real-time solar generation forecast using weather data, while exploiting both spatial and temporal structural dependencies is proposed. The network observed over time is projected to a lower-dimensional representation where a variety of weather measurements are used to train a structured regression model while weather forecast is used at the inference stage. Experiments were conducted at 288 locations in the San Antonio, TX area on obtained from the National Solar Radiation Database. The model predicts solar irradiance with a good accuracy (R2 0.91 for the summer, 0.85 for the winter, and 0.89 for the global model). The best accuracy was obtained by the Random Forest Regressor. Multiple experiments were conducted to characterize influence of missing data and different time horizons providing evidence that the new algorithm is robust for data missing not only completely at random but also when the mechanism is spatial, and temporal.
    Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning. (arXiv:2206.08657v1 [cs.CV])
    Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a cross-modal encoder, or feed the last-layer uni-modal features directly into the top cross-modal encoder, ignoring the semantic information at the different levels in the deep uni-modal encoders. Both approaches possibly restrict vision-language representation learning and limit model performance. In this paper, we introduce multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables comprehensive bottom-up interactions between visual and textual representations at different semantic levels, resulting in more effective cross-modal alignment and fusion. Our proposed Bridge-Tower, pre-trained with only $4$M images, achieves state-of-the-art performance on various downstream vision-language tasks. On the VQAv2 test-std set, Bridge-Tower achieves an accuracy of $78.73\%$, outperforming the previous state-of-the-art METER model by $1.09\%$ with the same pre-training data and almost no additional parameters and computational cost. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of $81.15\%$, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code is available at https://github.com/microsoft/BridgeTower.
    Sparse Double Descent: Where Network Pruning Aggravates Overfitting. (arXiv:2206.08684v1 [cs.LG])
    People usually believe that network pruning not only reduces the computational cost of deep networks, but also prevents overfitting by decreasing model capacity. However, our work surprisingly discovers that network pruning sometimes even aggravates overfitting. We report an unexpected sparse double descent phenomenon that, as we increase model sparsity via network pruning, test performance first gets worse (due to overfitting), then gets better (due to relieved overfitting), and gets worse at last (due to forgetting useful information). While recent studies focused on the deep double descent with respect to model overparameterization, they failed to recognize that sparsity may also cause double descent. In this paper, we have three main contributions. First, we report the novel sparse double descent phenomenon through extensive experiments. Second, for this phenomenon, we propose a novel learning distance interpretation that the curve of $\ell_{2}$ learning distance of sparse models (from initialized parameters to final parameters) may correlate with the sparse double descent curve well and reflect generalization better than minima flatness. Third, in the context of sparse double descent, a winning ticket in the lottery ticket hypothesis surprisingly may not always win.
    Machine Learning-Driven Process of Alumina Ceramics Laser Machining. (arXiv:2206.08747v1 [cs.CE])
    Laser machining is a highly flexible non-contact manufacturing technique that has been employed widely across academia and industry. Due to nonlinear interactions between light and matter, simulation methods are extremely crucial, as they help enhance the machining quality by offering comprehension of the inter-relationships between the laser processing parameters. On the other hand, experimental processing parameter optimization recommends a systematic, and consequently time-consuming, investigation over the available processing parameter space. An intelligent strategy is to employ machine learning (ML) techniques to capture the relationship between picosecond laser machining parameters for finding proper parameter combinations to create the desired cuts on industrial-grade alumina ceramic with deep, smooth and defect-free patterns. Laser parameters such as beam amplitude and frequency, scanner passing speed and the number of passes over the surface, as well as the vertical distance of the scanner from the sample surface, are used for predicting the depth, top width, and bottom width of the engraved channels using ML models. Owing to the complex correlation between laser parameters, it is shown that Neural Networks (NN) are the most efficient in predicting the outputs. Equipped with an ML model that captures the interconnection between laser parameters and the engraved channel dimensions, one can predict the required input parameters to achieve a target channel geometry. This strategy significantly reduces the cost and effort of experimental laser machining during the development phase, without compromising accuracy or performance. The developed techniques can be applied to a wide range of ceramic laser machining processes.
    Fast Lossless Neural Compression with Integer-Only Discrete Flows. (arXiv:2206.08869v1 [cs.LG])
    By applying entropy codecs with learned data distributions, neural compressors have significantly outperformed traditional codecs in terms of compression ratio. However, the high inference latency of neural networks hinders the deployment of neural compressors in practical applications. In this work, we propose Integer-only Discrete Flows (IODF), an efficient neural compressor with integer-only arithmetic. Our work is built upon integer discrete flows, which consists of invertible transformations between discrete random variables. We propose efficient invertible transformations with integer-only arithmetic based on 8-bit quantization. Our invertible transformation is equipped with learnable binary gates to remove redundant filters during inference. We deploy IODF with TensorRT on GPUs, achieving 10x inference speedup compared to the fastest existing neural compressors, while retaining the high compression rates on ImageNet32 and ImageNet64.
    DFG-NAS: Deep and Flexible Graph Neural Architecture Search. (arXiv:2206.08582v1 [cs.LG])
    Graph neural networks (GNNs) have been intensively applied to various graph-based applications. Despite their success, manually designing the well-behaved GNNs requires immense human expertise. And thus it is inefficient to discover the potentially optimal data-specific GNN architecture. This paper proposes DFG-NAS, a new neural architecture search (NAS) method that enables the automatic search of very deep and flexible GNN architectures. Unlike most existing methods that focus on micro-architectures, DFG-NAS highlights another level of design: the search for macro-architectures on how atomic propagation (\textbf{\texttt{P}}) and transformation (\textbf{\texttt{T}}) operations are integrated and organized into a GNN. To this end, DFG-NAS proposes a novel search space for \textbf{\texttt{P-T}} permutations and combinations based on message-passing dis-aggregation, defines four custom-designed macro-architecture mutations, and employs the evolutionary algorithm to conduct an efficient and effective search. Empirical studies on four node classification tasks demonstrate that DFG-NAS outperforms state-of-the-art manual designs and NAS methods of GNNs.
    Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection. (arXiv:2206.08726v1 [cs.SE])
    Code clones are pairs of code snippets that implement similar functionality. Clone detection is a fundamental branch of automatic source code comprehension, having many applications in refactoring recommendation, plagiarism detection, and code summarization. A particularly interesting case of clone detection is the detection of semantic clones, i.e., code snippets that have the same functionality but significantly differ in implementation. A promising approach to detecting semantic clones is contrastive learning (CL), a machine learning paradigm popular in computer vision but not yet commonly adopted for code processing. Our work aims to evaluate the most popular CL algorithms combined with three source code representations on two tasks. The first task is code clone detection, which we evaluate on the POJ-104 dataset containing implementations of 104 algorithms. The second task is plagiarism detection. To evaluate the models on this task, we introduce CodeTransformator, a tool for transforming source code. We use it to create a dataset that mimics plagiarised code based on competitive programming solutions. We trained nine models for both tasks and compared them with six existing approaches, including traditional tools and modern pre-trained neural models. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others. Among CL algorithms, SimCLR and SwAV lead to better results, while Moco is the most robust approach. Our code and trained models are available at https://doi.org/10.5281/zenodo.6360627, https://doi.org/10.5281/zenodo.5596345.
    Federated learning with incremental clustering for heterogeneous data. (arXiv:2206.08752v1 [cs.LG])
    Federated learning enables different parties to collaboratively build a global model under the orchestration of a server while keeping the training data on clients' devices. However, performance is affected when clients have heterogeneous data. To cope with this problem, we assume that despite data heterogeneity, there are groups of clients who have similar data distributions that can be clustered. In previous approaches, in order to cluster clients the server requires clients to send their parameters simultaneously. However, this can be problematic in a context where there is a significant number of participants that may have limited availability. To prevent such a bottleneck, we propose FLIC (Federated Learning with Incremental Clustering), in which the server exploits the updates sent by clients during federated training instead of asking them to send their parameters simultaneously. Hence no additional communications between the server and the clients are necessary other than what classical federated learning requires. We empirically demonstrate for various non-IID cases that our approach successfully splits clients into groups following the same data distributions. We also identify the limitations of FLIC by studying its capability to partition clients at the early stages of the federated learning process efficiently. We further address attacks on models as a form of data heterogeneity and empirically show that FLIC is a robust defense against poisoning attacks even when the proportion of malicious clients is higher than 50\%.
    Fast Population-Based Reinforcement Learning on a Single Machine. (arXiv:2206.08888v1 [cs.LG])
    Training populations of agents has demonstrated great promise in Reinforcement Learning for stabilizing training, improving exploration and asymptotic performance, and generating a diverse set of solutions. However, population-based training is often not considered by practitioners as it is perceived to be either prohibitively slow (when implemented sequentially), or computationally expensive (if agents are trained in parallel on independent accelerators). In this work, we compare implementations and revisit previous studies to show that the judicious use of compilation and vectorization allows population-based training to be performed on a single machine with one accelerator with minimal overhead compared to training a single agent. We also show that, when provided with a few accelerators, our protocols extend to large population sizes for applications such as hyperparameter tuning. We hope that this work and the public release of our code will encourage practitioners to use population-based learning more frequently for their research and applications.
    Fast Finite Width Neural Tangent Kernel. (arXiv:2206.08720v1 [cs.LG])
    The Neural Tangent Kernel (NTK), defined as $\Theta_\theta^f(x_1, x_2) = \left[\partial f(\theta, x_1)\big/\partial \theta\right] \left[\partial f(\theta, x_2)\big/\partial \theta\right]^T$ where $\left[\partial f(\theta, \cdot)\big/\partial \theta\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency. Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks. We open-source our implementations within the Neural Tangents package (arXiv:1912.02803) at https://github.com/google/neural-tangents.
    TUSK: Task-Agnostic Unsupervised Keypoints. (arXiv:2206.08460v1 [cs.CV])
    Existing unsupervised methods for keypoint learning rely heavily on the assumption that a specific keypoint type (e.g. elbow, digit, abstract geometric shape) appears only once in an image. This greatly limits their applicability, as each instance must be isolated before applying the method-an issue that is never discussed or evaluated. We thus propose a novel method to learn Task-agnostic, UnSupervised Keypoints (TUSK) which can deal with multiple instances. To achieve this, instead of the commonly-used strategy of detecting multiple heatmaps, each dedicated to a specific keypoint type, we use a single heatmap for detection, and enable unsupervised learning of keypoint types through clustering. Specifically, we encode semantics into the keypoints by teaching them to reconstruct images from a sparse set of keypoints and their descriptors, where the descriptors are forced to form distinct clusters in feature space around learned prototypes. This makes our approach amenable to a wider range of tasks than any previous unsupervised keypoint method: we show experiments on multiple-instance detection and classification, object discovery, and landmark detection-all unsupervised-with performance on par with the state of the art, while also being able to deal with multiple instances.  ( 2 min )
    TKIL: Tangent Kernel Approach for Class Balanced Incremental Learning. (arXiv:2206.08492v1 [cs.LG])
    When learning new tasks in a sequential manner, deep neural networks tend to forget tasks that they previously learned, a phenomenon called catastrophic forgetting. Class incremental learning methods aim to address this problem by keeping a memory of a few exemplars from previously learned tasks, and distilling knowledge from them. However, existing methods struggle to balance the performance across classes since they typically overfit the model to the latest task. In our work, we propose to address these challenges with the introduction of a novel methodology of Tangent Kernel for Incremental Learning (TKIL) that achieves class-balanced performance. The approach preserves the representations across classes and balances the accuracy for each class, and as such achieves better overall accuracy and variance. TKIL approach is based on Neural Tangent Kernel (NTK), which describes the convergence behavior of neural networks as a kernel function in the limit of infinite width. In TKIL, the gradients between feature layers are treated as the distance between the representations of these layers and can be defined as Gradients Tangent Kernel loss (GTK loss) such that it is minimized along with averaging weights. This allows TKIL to automatically identify the task and to quickly adapt to it during inference. Experiments on CIFAR-100 and ImageNet datasets with various incremental learning settings show that these strategies allow TKIL to outperform existing state-of-the-art methods.
    Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations. (arXiv:2202.12932v2 [stat.ML] UPDATED)
    End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and inference of biologically meaningful system inputs.
    Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks. (arXiv:2206.08465v1 [stat.ML])
    Biclustering on bipartite graphs is an unsupervised learning task that simultaneously clusters the two types of objects in the graph, for example, users and movies in a movie review dataset. The latent block model (LBM) has been proposed as a model-based tool for biclustering. Biclustering results by the LBM are, however, usually dominated by the row and column sums of the data matrix, i.e., degrees. We propose a degree-corrected latent block model (DC-LBM) to accommodate degree heterogeneity in row and column clusters, which greatly outperforms the classical LBM in the MovieLens dataset and simulated data. We develop an efficient variational expectation-maximization algorithm by observing that the row and column degrees maximize the objective function in the M step given any probability assignment on the cluster labels. We prove the label consistency of the variational estimator under the DC-LBM, which allows the expected graph density goes to zero as long as the average expected degrees of rows and columns go to infinity.
    Reframed GES with a Neural Conditional Dependence Measure. (arXiv:2206.08531v1 [stat.ML])
    In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal structure. We observe that in order to make the GES algorithm consistent in a nonparametric setting, it is not necessary to design a scoring metric that evaluates graphs. Instead, it suffices to plug in a consistent estimator of a measure of conditional dependence to guide the search. We therefore present a reframing of the GES algorithm, which is more flexible than the standard score-based version and readily lends itself to the nonparametric setting with a general measure of conditional dependence. In addition, we propose a neural conditional dependence (NCD) measure, which utilizes the expressive power of deep neural networks to characterize conditional independence in a nonparametric manner. We establish the optimality of the reframed GES algorithm under standard assumptions and the consistency of using our NCD estimator to decide conditional independence. Together these results justify the proposed approach. Experimental results demonstrate the effectiveness of our method in causal discovery, as well as the advantages of using our NCD measure over kernel-based measures.
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v1 [stat.ML])
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting. FiT uses an automatically configured Naive Bayes classifier on top of a fixed backbone that has been pretrained on large image datasets. Parameter efficient FiLM layers are used to modulate the backbone, shaping the representation for the downstream task. The network is trained via an episodic fine-tuning protocol. The approach is parameter efficient which is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the state-of-the-art Big Transfer (BiT) algorithm at low-shot and on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.
    A Parametric Class of Approximate Gradient Updates for Policy Optimization. (arXiv:2206.08499v1 [cs.LG])
    Approaches to policy optimization have been motivated from diverse principles, based on how the parametric model is interpreted (e.g. value versus policy representation) or how the learning objective is formulated, yet they share a common goal of maximizing expected return. To better capture the commonalities and identify key differences between policy optimization methods, we develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function. In particular, we identify a parameterized space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO. As a result, we obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality. An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.
    Holistic Transformer: A Joint Neural Network for Trajectory Prediction and Decision-Making of Autonomous Vehicles. (arXiv:2206.08809v1 [cs.LG])
    Trajectory prediction and behavioral decision-making are two important tasks for autonomous vehicles that require good understanding of the environmental context; behavioral decisions are better made by referring to the outputs of trajectory predictions. However, most current solutions perform these two tasks separately. Therefore, a joint neural network that combines multiple cues is proposed and named as the holistic transformer to predict trajectories and make behavioral decisions simultaneously. To better explore the intrinsic relationships between cues, the network uses existing knowledge and adopts three kinds of attention mechanisms: the sparse multi-head type for reducing noise impact, feature selection sparse type for optimally using partial prior knowledge, and multi-head with sigmoid activation type for optimally using posteriori knowledge. Compared with other trajectory prediction models, the proposed model has better comprehensive performance and good interpretability. Perceptual noise robustness experiments demonstrate that the proposed model has good noise robustness. Thus, simultaneous trajectory prediction and behavioral decision-making combining multiple cues can reduce computational costs and enhance semantic relationships between scenes and agents.
    A Theoretical Analysis on Independence-driven Importance Weighting for Covariate-shift Generalization. (arXiv:2111.02355v2 [cs.LG] UPDATED)
    Covariate-shift generalization, a typical case in out-of-distribution (OOD) generalization, requires a good performance on the unknown test distribution, which varies from the accessible training distribution in the form of covariate shift. Recently, independence-driven importance weighting algorithms in stable learning literature have shown empirical effectiveness to deal with covariate-shift generalization on several learning models, including regression algorithms and deep neural networks, while their theoretical analyses are missing. In this paper, we theoretically prove the effectiveness of such algorithms by explaining them as feature selection processes. We first specify a set of variables, named minimal stable variable set, that is the minimal and optimal set of variables to deal with covariate-shift generalization for common loss functions, such as the mean squared loss and binary cross-entropy loss. Afterward, we prove that under ideal conditions, independence-driven importance weighting algorithms could identify the variables in this set. Analysis of asymptotic properties is also provided. These theories are further validated in several synthetic experiments.
    Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity. (arXiv:2006.04429v3 [math.OC] UPDATED)
    We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular, we first summarize known instance-dependent complexity results and categorize them into three levels. We identify the domination relation between different levels and propose a fourth instance-dependent bound that dominates existing ones. We then provide a sufficient condition according to which an adaptive algorithm with moment estimation can achieve the proposed bound without knowledge of noise levels. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation as it achieves improved instance complexity.
    Online Algorithms with Multiple Predictions. (arXiv:2205.03921v2 [cs.LG] UPDATED)
    This paper studies online algorithms augmented with multiple machine-learned predictions. While online algorithms augmented with a single prediction have been extensively studied in recent years, the literature for the multiple predictions setting is sparse. In this paper, we give a generic algorithmic framework for online covering problems with multiple predictions that obtains an online solution that is competitive against the performance of the best predictor. Our algorithm incorporates the use of predictions in the classic potential-based analysis of online algorithms. We apply our algorithmic framework to solve classical problems such as online set cover, (weighted) caching, and online facility location in the multiple predictions setting. Our algorithm can also be robustified, i.e., the algorithm can be simultaneously made competitive against the best prediction and the performance of the best online algorithm (without prediction).
    Near-Optimal No-Regret Learning for General Convex Games. (arXiv:2206.08742v1 [cs.GT])
    A recent line of work has established uncoupled learning dynamics such that, when employed by all players in a game, each player's \emph{regret} after $T$ repetitions grows polylogarithmically in $T$, an exponential improvement over the traditional guarantees within the no-regret framework. However, so far these results have only been limited to certain classes of games with structured strategy spaces -- such as normal-form and extensive-form games. The question as to whether $O(\text{polylog} T)$ regret bounds can be obtained for general convex and compact strategy sets -- which occur in many fundamental models in economics and multiagent systems -- while retaining efficient strategy updates is an important question. In this paper, we answer this in the positive by establishing the first uncoupled learning algorithm with $O(\log T)$ per-player regret in general \emph{convex games}, that is, games with concave utility functions supported on arbitrary convex and compact strategy sets. Our learning dynamics are based on an instantiation of optimistic follow-the-regularized-leader over an appropriately \emph{lifted} space using a \emph{self-concordant regularizer} that is, peculiarly, not a barrier for the feasible region. Further, our learning dynamics are efficiently implementable given access to a proximal oracle for the convex strategy set, leading to $O(\log\log T)$ per-iteration complexity; we also give extensions when access to only a \emph{linear} optimization oracle is assumed. Finally, we adapt our dynamics to guarantee $O(\sqrt{T})$ regret in the adversarial regime. Even in those special cases where prior results apply, our algorithm improves over the state-of-the-art regret bounds either in terms of the dependence on the number of iterations or on the dimension of the strategy sets.  ( 3 min )
    Random projections and Kernelised Leave One Cluster Out Cross-Validation: Universal baselines and evaluation tools for supervised machine learning for materials properties. (arXiv:2206.08841v1 [cs.LG])
    With machine learning being a popular topic in current computational materials science literature, creating representations for compounds has become common place. These representations are rarely compared, as evaluating their performance - and the performance of the algorithms that they are used with - is non-trivial. With many materials datasets containing bias and skew caused by the research process, leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials. This raises the question of the impact, and control, of the range of cluster sizes on the LOCO-CV measurement outcomes. We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to better separate data to enhance LOCO-CV applications. We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception. We also find that the radial basis function improves the linear separability of chemical datasets in all 10 datasets tested and provide a framework for the application of this function in the LOCO-CV process to improve the outcome of LOCO-CV measurements regardless of machine learning algorithm, choice of metric, and choice of compound representation. We recommend kernelised LOCO-CV as a training paradigm for those looking to measure the extrapolatory power of an algorithm on materials data.
    Optimizing Sequential Experimental Design with Deep Reinforcement Learning. (arXiv:2202.00821v3 [cs.LG] UPDATED)
    Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design space, require access to a differentiable probabilistic model and can only optimize over continuous design spaces. Here, we address these limitations by showing that the problem of optimizing policies can be reduced to solving a Markov decision process (MDP). We solve the equivalent MDP with modern deep reinforcement learning techniques. Our experiments show that our approach is also computationally efficient at deployment time and exhibits state-of-the-art performance on both continuous and discrete design spaces, even when the probabilistic model is a black box.
    Distribution Regression with Sliced Wasserstein Kernels. (arXiv:2202.03926v2 [stat.ML] UPDATED)
    The problem of learning functions over spaces of probabilities - or distribution regression - is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing. In this work, we propose an OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators.
    Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering. (arXiv:2204.09634v2 [cs.SD] UPDATED)
    Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer. In this paper, we introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset. For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk. The questions and answers are produced by different annotators. Out of the six questions for each audio, two questions each are designed to have 'yes' and 'no' as answers, while the remaining two questions have other single-word answers. For each question, we collect answers from three different annotators. We also present two baseline experiments to describe the usage of our dataset for the AQA task - an LSTM-based multimodal binary classifier for 'yes' or 'no' type answers and an LSTM-based multimodal multi-class classifier for 828 single-word answers. The binary classifier achieved an accuracy of 62.7% and the multi-class classifier achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA dataset is freely available online at https://zenodo.org/record/6473207.
    MET: Masked Encoding for Tabular Data. (arXiv:2206.08564v1 [cs.LG])
    We consider the task of self-supervised representation learning (SSL) for tabular data: tabular-SSL. Typical contrastive learning based SSL methods require instance-wise data augmentations which are difficult to design for unstructured tabular data. Existing tabular-SSL methods design such augmentations in a relatively ad-hoc fashion and can fail to capture the underlying data manifold. Instead of augmentations based approaches for tabular-SSL, we propose a new reconstruction based method, called Masked Encoding for Tabular Data (MET), that does not require augmentations. MET is based on the popular MAE approach for vision-SSL [He et al., 2021] and uses two key ideas: (i) since each coordinate in a tabular dataset has a distinct meaning, we need to use separate representations for all coordinates, and (ii) using an adversarial reconstruction loss in addition to the standard one. Empirical results on five diverse tabular datasets show that MET achieves a new state of the art (SOTA) on all of these datasets and improves up to 9% over current SOTA methods. We shed more light on the working of MET via experiments on carefully designed simple datasets.
    How robust are pre-trained models to distribution shift?. (arXiv:2206.08871v1 [cs.LG])
    The vulnerability of machine learning models to spurious correlations has mostly been discussed in the context of supervised learning (SL). However, there is a lack of insight on how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE). In this work, we shed light on this by evaluating the performance of these models on both real world and synthetic distribution shift datasets. Following observations that the linear head itself can be susceptible to spurious correlations, we develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation. With this new methodology, we show that SSL models are consistently more robust to distribution shifts and thus better at OOD generalisation than AE and SL models.
    On Testability of the Front-Door Model via Verma Constraints. (arXiv:2203.00161v2 [stat.ME] UPDATED)
    The front-door criterion can be used to identify and compute causal effects despite the existence of unmeasured confounders between a treatment and outcome. However, the key assumptions -- (i) the existence of a variable (or set of variables) that fully mediates the effect of the treatment on the outcome, and (ii) which simultaneously does not suffer from similar issues of confounding as the treatment-outcome pair -- are often deemed implausible. This paper explores the testability of these assumptions. We show that under mild conditions involving an auxiliary variable, the assumptions encoded in the front-door model (and simple extensions of it) may be tested via generalized equality constraints a.k.a Verma constraints. We propose two goodness-of-fit tests based on this observation, and evaluate the efficacy of our proposal on real and synthetic data. We also provide theoretical and empirical comparisons to instrumental variable approaches to handling unmeasured confounding.
    Author Clustering and Topic Estimation for Short Texts. (arXiv:2106.09533v2 [cs.IR] UPDATED)
    Analysis of short text, such as social media posts, is extremely difficult because of their inherent brevity. In addition to classifying topics of such posts, a common downstream task is grouping the authors of these documents for subsequent analyses. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology. We also develop a novel measure of echo chambers among these politicians by characterizing insularity of topics discussed by groups of Senators and provide uncertainty quantification.
    Unsolved Problems in ML Safety. (arXiv:2109.13916v5 [cs.LG] UPDATED)
    Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), reducing inherent model hazards ("Alignment"), and reducing systemic hazards ("Systemic Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.
    Out-of-Distribution Detection with Deep Nearest Neighbors. (arXiv:2204.06507v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection is a critical task for deploying machine learning models in the open world. Distance-based methods have demonstrated promise, where testing samples are detected as OOD if they are relatively far away from in-distribution (ID) data. However, prior methods impose a strong distributional assumption of the underlying feature space, which may not always hold. In this paper, we explore the efficacy of non-parametric nearest-neighbor distance for OOD detection, which has been largely overlooked in the literature. Unlike prior works, our method does not impose any distributional assumption, hence providing stronger flexibility and generality. We demonstrate the effectiveness of nearest-neighbor-based OOD detection on several benchmarks and establish superior performance. Under the same model trained on ImageNet-1k, our method substantially reduces the false positive rate (FPR@TPR95) by 24.77% compared to a strong baseline SSD+, which uses a parametric approach Mahalanobis distance in detection. Code is available: https://github.com/deeplearning-wisc/knn-ood.
    Importance Sampling Placement in Off-Policy Temporal-Difference Methods. (arXiv:2203.10172v2 [cs.LG] UPDATED)
    A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling. In off-policy learning, the agent learns about a different policy than the one being executed. To account for the difference importance sampling ratios are often used, but can increase variance in the algorithms and reduce the rate of learning. Several variations of importance sampling have been proposed to reduce variance, with per-decision importance sampling being the most popular. However, the update rules for most off-policy algorithms in the literature depart from per-decision importance sampling in a subtle way; they correct the entire TD error instead of just the TD target. In this work, we show how this slight change can be interpreted as a control variate for the TD target, reducing variance and improving performance. Experiments over a wide range of algorithms show this subtle modification results in improved performance.
    Modelling Evolutionary and Stationary User Preferences for Temporal Sets Prediction. (arXiv:2204.05490v4 [cs.LG] UPDATED)
    Given a sequence of sets, where each set is associated with a timestamp and contains an arbitrary number of elements, the task of temporal sets prediction aims to predict the elements in the subsequent set. Previous studies for temporal sets prediction mainly capture each user's evolutionary preference by learning from his/her own sequence. Although insightful, we argue that: 1) the collaborative signals latent in different users' sequences are essential but have not been exploited; 2) users also tend to show stationary preferences while existing methods fail to consider. To this end, we propose an integrated learning framework to model both the evolutionary and the stationary preferences of users for temporal sets prediction, which first constructs a universal sequence by chronologically arranging all the user-set interactions, and then learns on each user-set interaction. In particular, for each user-set interaction, we first design an evolutionary user preference modelling component to track the user's time-evolving preference and exploit the latent collaborative signals among different users. This component maintains a memory bank to store memories of the related user and elements, and continuously updates their memories based on the currently encoded messages and the past memories. Then, we devise a stationary user preference modelling module to discover each user's personalized characteristics according to the historical sequence, which adaptively aggregates the previously interacted elements from dual perspectives with the guidance of the user's and elements' embeddings. Finally, we develop a set-batch algorithm to improve the model efficiency, which can create time-consistent batches in advance and achieve 3.5x training speedups on average. Experiments on real-world datasets demonstrate the effectiveness and good interpretability of our approach.
    Representational Multiplicity Should Be Exposed, Not Eliminated. (arXiv:2206.08890v1 [cs.LG])
    It is prevalent and well-observed, but poorly understood, that two machine learning models with similar performance during training can have very different real-world performance characteristics. This implies elusive differences in the internals of the models, manifesting as representational multiplicity (RM). We introduce a conceptual and experimental setup for analyzing RM and show that certain training methods systematically result in greater RM than others, measured by activation similarity via singular vector canonical correlation analysis (SVCCA). We further correlate it with predictive multiplicity measured by the variance in i.i.d. and out-of-distribution test set predictions, in four common image data sets. We call for systematic measurement and maximal exposure, not elimination, of RM in models. Qualitative tools such as our confabulator analysis can facilitate understanding and communication of RM effects to stakeholders.
    Scaling multi-species occupancy models to large citizen science datasets. (arXiv:2206.08894v1 [stat.AP])
    Citizen science datasets can be very large and promise to improve species distribution modelling, but detection is imperfect, risking bias when fitting models. In particular, observers may not detect species that are actually present. Occupancy models can estimate and correct for this observation process, and multi-species occupancy models exploit similarities in the observation process, which can improve estimates for rare species. However, the computational methods currently used to fit these models do not scale to large datasets. We develop approximate Bayesian inference methods and use graphics processing units (GPUs) to scale multi-species occupancy models to very large citizen science data. We fit multi-species occupancy models to one month of data from the eBird project consisting of 186,811 checklist records comprising 430 bird species. We evaluate the predictions on a spatially separated test set of 59,338 records, comparing two different inference methods -- Markov chain Monte Carlo (MCMC) and variational inference (VI) -- to occupancy models fitted to each species separately using maximum likelihood. We fitted models to the entire dataset using VI, and up to 32,000 records with MCMC. VI fitted to the entire dataset performed best, outperforming single-species models on both AUC (90.4% compared to 88.7%) and on log likelihood (-0.080 compared to -0.085). We also evaluate how well range maps predicted by the model agree with expert maps. We find that modelling the detection process greatly improves agreement and that the resulting maps agree as closely with expert maps as ones estimated using high quality survey data. Our results demonstrate that multi-species occupancy models are a compelling approach to model large citizen science datasets, and that, once the observation process is taken into account, they can model species distributions accurately.
    Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark. (arXiv:2202.06767v3 [cs.CV] UPDATED)
    Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et~al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.
    A Sparsity-promoting Dictionary Model for Variational Autoencoders. (arXiv:2203.15758v2 [cs.LG] UPDATED)
    Structuring the latent space in probabilistic deep generative models, e.g., variational autoencoders (VAEs), is important to yield more expressive models and interpretable representations, and to avoid overfitting. One way to achieve this objective is to impose a sparsity constraint on the latent variables, e.g., via a Laplace prior. However, such approaches usually complicate the training phase, and they sacrifice the reconstruction quality to promote sparsity. In this paper, we propose a simple yet effective methodology to structure the latent space via a sparsity-promoting dictionary model, which assumes that each latent code can be written as a sparse linear combination of a dictionary's columns. In particular, we leverage a computationally efficient and tuning-free method, which relies on a zero-mean Gaussian latent prior with learnable variances. We derive a variational inference scheme to train the model. Experiments on speech generative modeling demonstrate the advantage of the proposed approach over competing techniques, since it promotes sparsity while not deteriorating the output speech quality.
    Sketching Algorithms and Lower Bounds for Ridge Regression. (arXiv:2204.06653v2 [cs.DS] UPDATED)
    We give a sketching-based iterative algorithm that computes a $1+\varepsilon$ approximate solution for the ridge regression problem $\min_x \|Ax-b\|_2^2 +\lambda\|x\|_2^2$ where $A \in R^{n \times d}$ with $d \ge n$. Our algorithm, for a constant number of iterations (requiring a constant number of passes over the input), improves upon earlier work (Chowdhury et al.) by requiring that the sketching matrix only has a weaker Approximate Matrix Multiplication (AMM) guarantee that depends on $\varepsilon$, along with a constant subspace embedding guarantee. The earlier work instead requires that the sketching matrix has a subspace embedding guarantee that depends on $\varepsilon$. For example, to produce a $1+\varepsilon$ approximate solution in $1$ iteration, which requires $2$ passes over the input, our algorithm requires the OSNAP embedding to have $m= O(n\sigma^2/\lambda\varepsilon)$ rows with a sparsity parameter $s = O(\log(n))$, whereas the earlier algorithm of Chowdhury et al. with the same number of rows of OSNAP requires a sparsity $s = O(\sqrt{\sigma^2/\lambda\varepsilon} \cdot \log(n))$, where $\sigma = \opnorm{A}$ is the spectral norm of the matrix $A$. We also show that this algorithm can be used to give faster algorithms for kernel ridge regression. Finally, we show that the sketch size required for our algorithm is essentially optimal for a natural framework of algorithms for ridge regression by proving lower bounds on oblivious sketching matrices for AMM. The sketch size lower bounds for AMM may be of independent interest.
    EGRU: Event-based GRU for activity-sparse inference and learning. (arXiv:2206.06178v1 [cs.LG] CROSS LISTED)
    The scalability of recurrent neural networks (RNNs) is hindered by the sequential dependence of each time step's computation on the previous time step's output. Therefore, one way to speed up and scale RNNs is to reduce the computation required at each time step independent of model size and task. In this paper, we propose a model that reformulates Gated Recurrent Units (GRU) as an event-based activity-sparse model that we call the Event-based GRU (EGRU), where units compute updates only on receipt of input events (event-based) from other units. When combined with having only a small fraction of the units active at a time (activity-sparse), this model has the potential to be vastly more compute efficient than current RNNs. Notably, activity-sparsity in our model also translates into sparse parameter updates during gradient descent, extending this compute efficiency to the training phase. We show that the EGRU demonstrates competitive performance compared to state-of-the-art recurrent network models in real-world tasks, including language modeling while maintaining high activity sparsity naturally during inference and training. This sets the stage for the next generation of recurrent networks that are scalable and more suitable for novel neuromorphic hardware.
    How You Start Matters for Generalization. (arXiv:2206.08558v1 [cs.LG])
    Characterizing the remarkable generalization properties of over-parameterized neural networks remains an open problem. In this paper, we promote a shift of focus towards initialization rather than neural architecture or (stochastic) gradient descent to explain this implicit regularization. Through a Fourier lens, we derive a general result for the spectral bias of neural networks and show that the generalization of neural networks is heavily tied to their initialization. Further, we empirically solidify the developed theoretical insights using practical, deep networks. Finally, we make a case against the controversial flat-minima conjecture and show that Fourier analysis grants a more reliable framework for understanding the generalization of neural networks.  ( 2 min )
    Grounded Language-Image Pre-training. (arXiv:2112.03857v2 [cs.CV] UPDATED)
    This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head. Code is released at https://github.com/microsoft/GLIP.
    BED: A Real-Time Object Detection System for Edge Devices. (arXiv:2202.07503v2 [cs.CV] UPDATED)
    Deploying deep neural networks~(DNNs) on edge devices provides efficient and effective solutions for the real-world tasks. Edge devices have been used for collecting a large volume of data efficiently in different domains. DNNs have been an effective tool for data processing and analysis. However, designing DNNs on edge devices is challenging due to the limited computational resources and memory. To tackle this challenge, we demonstrate Object Detection System for Edge Devices~(BED) on the MAX78000 DNN accelerator. It integrates on-device DNN inference with a camera and an LCD display for image acquisition and detection exhibition, respectively. BED is a concise, effective and detailed solution, including model training, quantization, synthesis and deployment. Experiment results indicate that BED can produce accurate detection with a 300-KB tiny DNN model, which takes only 91.9 ms of inference time and 1.845 mJ of energy.
    Stochastic Perturbations of Tabular Features for Non-Deterministic Inference with Automunge. (arXiv:2202.09248v2 [cs.LG] UPDATED)
    Injecting gaussian noise into training features is well known to have regularization properties. This paper considers noise injections to numeric or categoric tabular features as passed to inference, which translates inference to a non-deterministic outcome and may have relevance to fairness considerations, adversarial example protection, or other use cases benefiting from non-determinism. We offer the Automunge library for tabular preprocessing as a resource for the practice, which includes options to integrate random sampling or entropy seeding with the support of quantum circuits, representing a new way to channel quantum algorithms into classical learning.
    Deep Networks on Toroids: Removing Symmetries Reveals the Structure of Flat Regions in the Landscape Geometry. (arXiv:2202.03038v2 [cs.LG] UPDATED)
    We systematize the approach to the investigation of deep neural network landscapes by basing it on the geometry of the space of implemented functions rather than the space of parameters. Grouping classifiers into equivalence classes, we develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology. On this space, we explore the error landscape rather than the loss. This lets us derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them. Using different optimization algorithms that sample minimizers with different flatness we study the mode connectivity and relative distances. Testing a variety of state-of-the-art architectures and benchmark datasets, we confirm the correlation between flatness and generalization performance; we further show that in function space flatter minima are closer to each other and that the barriers along the geodesics connecting them are small. We also find that minimizers found by variants of gradient descent can be connected by zero-error paths composed of two straight lines in parameter space, i.e. polygonal chains with a single bend. We observe similar qualitative results in neural networks with binary weights and activations, providing one of the first results concerning the connectivity in this setting. Our results hinge on symmetry removal, and are in remarkable agreement with the rich phenomenology described by some recent analytical studies performed on simple shallow models.
    MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages. (arXiv:2204.08582v2 [cs.CL] UPDATED)
    We present the MASSIVE dataset--Multilingual Amazon Slu resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation. MASSIVE contains 1M realistic, parallel, labeled virtual assistant utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. MASSIVE was created by tasking professional translators to localize the English-only SLURP dataset into 50 typologically diverse languages from 29 genera. We also present modeling results on XLM-R and mT5, including exact match accuracy, intent classification accuracy, and slot-filling F1 score. We have released our dataset, modeling code, and models publicly.
    Toward Compositional Generalization in Object-Oriented World Modeling. (arXiv:2204.13661v2 [cs.LG] UPDATED)
    Compositional generalization is a critical ability in learning and decision-making. We focus on the setting of reinforcement learning in object-oriented environments to study compositional generalization in world modeling. We (1) formalize the compositional generalization problem with an algebraic approach and (2) study how a world model can achieve that. We introduce a conceptual environment, Object Library, and two instances, and deploy a principled pipeline to measure the generalization ability. Motivated by the formulation, we analyze several methods with exact or no compositional generalization ability using our framework, and design a differentiable approach, Homomorphic Object-oriented World Model (HOWM), that achieves soft but more efficient compositional generalization.
    Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction. (arXiv:2205.06672v2 [cs.CV] UPDATED)
    Classical multiple instance learning (MIL) methods are often based on the identical and independent distributed assumption between instances, hence neglecting the potentially rich contextual information beyond individual entities. On the other hand, Transformers with global self-attention modules have been proposed to model the interdependencies among all instances. However, in this paper we question: Is global relation modeling using self-attention necessary, or can we appropriately restrict self-attention calculations to local regimes in large-scale whole slide images (WSIs)? We propose a general-purpose local attention graph-based Transformer for MIL (LA-MIL), introducing an inductive bias by explicitly contextualizing instances in adaptive local regimes of arbitrary size. Additionally, an efficiently adapted loss function enables our approach to learn expressive WSI embeddings for the joint analysis of multiple biomarkers. We demonstrate that LA-MIL achieves state-of-the-art results in mutation prediction for gastrointestinal cancer, outperforming existing models on important biomarkers such as microsatellite instability for colorectal cancer. Our findings suggest that local self-attention sufficiently models dependencies on par with global modules. Our LA-MIL implementation is available at https://github.com/agentdr1/LA_MIL.
    Reinforcement Learning in Macroeconomic Policy Design: A New Frontier?. (arXiv:2206.08781v1 [cs.LG])
    Agent-based computational macroeconomics is a field with a rich academic history, yet one which has struggled to enter mainstream policy design toolboxes, plagued by the challenges associated with representing a complex and dynamic reality. The field of Reinforcement Learning (RL), too, has a rich history, and has recently been at the centre of several exponential developments. Modern RL implementations have been able to achieve unprecedented levels of sophistication, handling previously-unthinkable degrees of complexity. This review surveys the historical barriers of classical agent-based techniques in macroeconomic modelling, and contemplates whether recent developments in RL can overcome any of them.  ( 2 min )
    SaDe: Learning Models that Provably Satisfy Domain Constraints. (arXiv:2112.00552v3 [cs.LG] UPDATED)
    In many real world applications of machine learning, models have to meet certain domain-based requirements that can be expressed as constraints (e.g., safety-critical constraints in autonomous driving systems). Such constraints are often handled by including them in a regularization term, while learning a model. This approach, however, does not guarantee 100% satisfaction of the constraints: it only reduces violations of the constraints on the training set rather than ensuring that the predictions by the model will always adhere to them. In this paper, we present a framework for learning models that provably fulfil the constraints under all circumstances (i.e., also on unseen data). To achieve this, we cast learning as a maximum satisfiability problem, and solve it using a novel SaDe algorithm that combines constraint satisfaction with gradient descent. We compare our method against regularization based baselines on linear models and show that our method is capable of enforcing different types of domain constraints effectively on unseen data, without sacrificing predictive performance.
    Scalable Deep Reinforcement Learning Algorithms for Mean Field Games. (arXiv:2203.11973v2 [cs.LG] UPDATED)
    Mean Field Games (MFGs) have been introduced to efficiently approximate games with very large populations of strategic agents. Recently, the question of learning equilibria in MFGs has gained momentum, particularly using model-free reinforcement learning (RL) methods. One limiting factor to further scale up using RL is that existing algorithms to solve MFGs require the mixing of approximated quantities such as strategies or $q$-values. This is far from being trivial in the case of non-linear function approximation that enjoy good generalization properties, e.g. neural networks. We propose two methods to address this shortcoming. The first one learns a mixed strategy from distillation of historical data into a neural network and is applied to the Fictitious Play algorithm. The second one is an online mixing method based on regularization that does not require memorizing historical data or previous estimates. It is used to extend Online Mirror Descent. We demonstrate numerically that these methods efficiently enable the use of Deep RL algorithms to solve various MFGs. In addition, we show that these methods outperform SotA baselines from the literature.
    Diffusion-GAN: Training GANs with Diffusion. (arXiv:2206.02262v2 [cs.LG] UPDATED)
    For stable training of generative adversarial networks (GANs), injecting instance noise into the input of the discriminator is considered as a theoretically sound solution, which, however, has not yet delivered on its promise in practice. This paper introduces Diffusion-GAN that employs a Gaussian mixture distribution, defined over all the diffusion steps of a forward diffusion chain, to inject instance noise. A random sample from the mixture, which is diffused from an observed or generated data, is fed as the input to the discriminator. The generator is updated by backpropagating its gradient through the forward diffusion chain, whose length is adaptively adjusted to control the maximum noise-to-data ratio allowed at each training step. Theoretical analysis verifies the soundness of the proposed Diffusion-GAN, which provides model- and domain-agnostic differentiable augmentation. A rich set of experiments on diverse datasets show that Diffusion-GAN can provide stable and data-efficient GAN training, bringing consistent performance improvement over strong GAN baselines for synthesizing photo-realistic images.
    A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks. (arXiv:2206.08514v1 [cs.LG])
    Textual backdoor attacks are a kind of practical threat to NLP systems. By injecting a backdoor in the training phase, the adversary could control model predictions via predefined triggers. As various attack and defense models have been proposed, it is of great significance to perform rigorous evaluations. However, we highlight two issues in previous backdoor learning evaluations: (1) The differences between real-world scenarios (e.g. releasing poisoned datasets or models) are neglected, and we argue that each scenario has its own constraints and concerns, thus requires specific evaluation protocols; (2) The evaluation metrics only consider whether the attacks could flip the models' predictions on poisoned samples and retain performances on benign samples, but ignore that poisoned samples should also be stealthy and semantic-preserving. To address these issues, we categorize existing works into three practical scenarios in which attackers release datasets, pre-trained models, and fine-tuned models respectively, then discuss their unique evaluation methodologies. On metrics, to completely evaluate poisoned samples, we use grammar error increase and perplexity difference for stealthiness, along with text similarity for validity. After formalizing the frameworks, we develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. With this toolkit, we perform extensive experiments to benchmark attack and defense models under the suggested paradigm. To facilitate the underexplored defenses against poisoned datasets, we further propose CUBE, a simple yet strong clustering-based defense baseline. We hope that our frameworks and benchmarks could serve as the cornerstones for future model development and evaluations.  ( 3 min )
    Adversarial Estimators. (arXiv:2204.10495v3 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their average objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.
    PDE-READ: Human-readable Partial Differential Equation Discovery using Deep Learning. (arXiv:2111.00998v5 [cs.LG] UPDATED)
    PDE discovery shows promise for uncovering predictive models of complex physical systems but has difficulty when measurements are sparse and noisy. We introduce a new approach for PDE discovery that uses two Rational Neural Networks and a principled sparse regression algorithm to identify the hidden dynamics that govern a system's response. The first network learns the system response function, while the second learns a hidden PDE describing the system's evolution. We then use a parameter-free sparse regression algorithm to extract a human-readable form of the hidden PDE from the second network. We implement our approach in an open-source library called PDE-READ. Our approach successfully identifies the governing PDE in six benchmark examples. We demonstrate that our approach is robust to both sparsity and noise and it, therefore, holds promise for application to real-world observational data.
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v2 [cs.LG] UPDATED)
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts, which we believe to shed new light on this problem.
    abess: A Fast Best Subset Selection Library in Python and R. (arXiv:2110.09697v2 [stat.ML] UPDATED)
    We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, the abess certifiably gets the optimal solution within polynomial times with high probability under the linear model. Our efficient implementation allows abess to attain the solution of best-subset selection problems as fast as or even 20x faster than existing competing variable (model) selection toolboxes. Furthermore, it supports common variants like best group subset selection and $\ell_2$ regularized best-subset selection. The core of the library is programmed in C++. For ease of use, a Python library is designed for conveniently integrating with scikit-learn, and it can be installed from the Python library Index. In addition, a user-friendly R library is available at the Comprehensive R Archive Network. The source code is available at: https://github.com/abess-team/abess.
    Personalized Federated Learning through Local Memorization. (arXiv:2111.09360v3 [cs.LG] UPDATED)
    Federated learning allows clients to collaboratively learn statistical models while keeping their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a separate model for each client while still leveraging the knowledge available at other clients. In this work, we exploit the ability of deep neural networks to extract high quality vectorial representations (embeddings) from non-tabular data, e.g., images and text, to propose a personalization mechanism based on local memorization. Personalization is obtained by interpolating a collectively trained global model with a local $k$-nearest neighbors (kNN) model based on the shared representation provided by the global model. We provide generalization bounds for the proposed approach in the case of binary classification, and we show on a suite of federated datasets that this approach achieves significantly higher accuracy and fairness than state-of-the-art methods.
    Meta-Learning Hypothesis Spaces for Sequential Decision-making. (arXiv:2202.00602v3 [stat.ML] UPDATED)
    Obtaining reliable, adaptive confidence sets for prediction functions (hypotheses) is a central challenge in sequential decision-making tasks, such as bandits and model-based reinforcement learning. These confidence sets typically rely on prior assumptions on the hypothesis space, e.g., the known kernel of a Reproducing Kernel Hilbert Space (RKHS). Hand-designing such kernels is error prone, and misspecification may lead to poor or unsafe performance. In this work, we propose to meta-learn a kernel from offline data (Meta-KeL). For the case where the unknown kernel is a combination of known base kernels, we develop an estimator based on structured sparsity. Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets that, with increasing amounts of offline data, become as tight as those given the true unknown kernel. We demonstrate our approach on the kernelized bandit problem (a.k.a.~Bayesian optimization), where we establish regret bounds competitive with those given the true kernel. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
    NeuralEF: Deconstructing Kernels by Deep Neural Networks. (arXiv:2205.00165v3 [cs.LG] UPDATED)
    Learning the principal eigenfunctions of an integral operator defined by a kernel and a data distribution is at the core of many machine learning problems. Traditional nonparametric solutions based on the Nystr{\"o}m formula suffer from scalability issues. Recent work has resorted to a parametric approach, i.e., training neural networks to approximate the eigenfunctions. However, the existing method relies on an expensive orthogonalization step and is difficult to implement. We show that these problems can be fixed by using a new series of objective functions that generalizes the EigenGame~\citep{gemp2020eigengame} to function space. We test our method on a variety of supervised and unsupervised learning problems and show it provides accurate approximations to the eigenfunctions of polynomial, radial basis, neural network Gaussian process, and neural tangent kernels. Finally, we demonstrate our method can scale up linearised Laplace approximation of deep neural networks to modern image classification datasets through approximating the Gauss-Newton matrix. Code is available at \url{https://github.com/thudzj/neuraleigenfunction}.
    Conditional GANs with Auxiliary Discriminative Classifier. (arXiv:2107.10060v5 [cs.LG] UPDATED)
    Conditional generative models aim to learn the underlying joint distribution of data and labels to achieve conditional data generation. Among them, the auxiliary classifier generative adversarial network (AC-GAN) has been widely used, but suffers from the problem of low intra-class diversity of the generated samples. The fundamental reason pointed out in this paper is that the classifier of AC-GAN is generator-agnostic, which therefore cannot provide informative guidance for the generator to approach the joint distribution, resulting in a minimization of the conditional entropy that decreases the intra-class diversity. Motivated by this understanding, we propose a novel conditional GAN with an auxiliary discriminative classifier (ADC-GAN) to resolve the above problem. Specifically, the proposed auxiliary discriminative classifier becomes generator-aware by recognizing the class-labels of the real data and the generated data discriminatively. Our theoretical analysis reveals that the generator can faithfully learn the joint distribution even without the original discriminator, making the proposed ADC-GAN robust to the value of the coefficient hyperparameter and the selection of the GAN loss, and stable during training. Extensive experimental results on synthetic and real-world datasets demonstrate the superiority of ADC-GAN in conditional generative modeling compared to state-of-the-art classifier-based and projection-based conditional GANs.
    Generative Coarse-Graining of Molecular Conformations. (arXiv:2201.12176v2 [cs.LG] UPDATED)
    Coarse-graining (CG) of molecular simulations simplifies the particle representation by grouping selected atoms into pseudo-beads and drastically accelerates simulation. However, such CG procedure induces information losses, which makes accurate backmapping, i.e., restoring fine-grained (FG) coordinates from CG coordinates, a long-standing challenge. Inspired by the recent progress in generative models and equivariant networks, we propose a novel model that rigorously embeds the vital probabilistic nature and geometric consistency requirements of the backmapping transformation. Our model encodes the FG uncertainties into an invariant latent space and decodes them back to FG geometries via equivariant convolutions. To standardize the evaluation of this domain, we provide three comprehensive benchmarks based on molecular dynamics trajectories. Experiments show that our approach always recovers more realistic structures and outperforms existing data-driven methods with a significant margin.
    ROCK: Causal Inference Principles for Reasoning about Commonsense Causality. (arXiv:2202.00436v2 [cs.CL] UPDATED)
    Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences. Motivated by classical causal principles, we articulate the central question of CCR and draw parallels between human subjects in observational studies and natural languages to adopt CCR to the potential-outcomes framework, which is the first such attempt for commonsense tasks. We propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision, and balances confounding effects using temporal propensities that are analogous to propensity scores. The ROCK implementation is modular and zero-shot, and demonstrates good CCR capabilities.
    Structure-preserving GANs. (arXiv:2202.01129v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the sigma-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fr\'echet Inception Distance -- especially in the small data regime.
    High-Speed Accurate Robot Control using Learned Forward Kinodynamics and Non-linear Least Squares Optimization. (arXiv:2206.08487v1 [cs.RO])
    Accurate control of robots in the real world requires a control system that is capable of taking into account the kinodynamic interactions of the robot with its environment. At high speeds, the dependence of the movement of the robot on these kinodynamic interactions becomes more pronounced, making high-speed, accurate robot control a challenging problem. Previous work has shown that learning the inverse kinodynamics (IKD) of the robot can be helpful for high-speed robot control. However a learned inverse kinodynamic model can only be applied to a limited class of control problems, and different control problems require the learning of a new IKD model. In this work we present a new formulation for accurate, high-speed robot control that makes use of a learned forward kinodynamic (FKD) model and non-linear least squares optimization. By nature of the formulation, this approach is extensible to a wide array of control problems without requiring the retraining of a new model. We demonstrate the ability of this approach to accurately control a scale one-tenth robot car at high speeds, and show improved results over baselines.  ( 2 min )
    Tight query complexity bounds for learning graph partitions. (arXiv:2112.07897v2 [cs.LG] UPDATED)
    Given a partition of a graph into connected components, the membership oracle asserts whether any two vertices of the graph lie in the same component or not. We prove that for $n\ge k\ge 2$, learning the components of an $n$-vertex hidden graph with $k$ components requires at least $(k-1)n-\binom k2$ membership queries. Our result improves on the best known information-theoretic bound of $\Omega(n\log k)$ queries, and exactly matches the query complexity of the algorithm introduced by [Reyzin and Srivastava, 2007] for this problem. Additionally, we introduce an oracle, with access to which one can learn the number of components of $G$ in asymptotically fewer queries than learning the full partition, thus answering another question posed by the same authors. Lastly, we introduce a more applicable version of this oracle, and prove asymptotically tight bounds of $\widetilde\Theta(m)$ queries for both learning and verifying an $m$-edge hidden graph $G$ using it.
    Deep learning, stochastic gradient descent and diffusion maps. (arXiv:2204.01365v3 [stat.ML] UPDATED)
    Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency, but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep neural networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to the top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper, we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular, of the landscape traced out by SGD by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discover (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration, we use diffusion maps introduced by R. Coifman and coauthors.
    Domain Adaptation for Time Series Forecasting via Attention Sharing. (arXiv:2102.06828v7 [cs.LG] UPDATED)
    Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where there is a limited number of time series or observations per time series, or both. To cope with this data scarcity issue, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF). DAF leverages statistical strengths from a relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we use an attention-based shared module with a domain discriminator across domains and private modules for individual domains. We induce domain-invariant latent features (queries and keys) and retrain domain-specific features (values) simultaneously to enable joint training of forecasters on source and target domains. A main insight is that our design of aligning keys allows the target domain to leverage source time series even with different characteristics. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets, and ablation studies verify the effectiveness of our design choices.
    Learning to Hash Robustly, Guaranteed. (arXiv:2108.05433v4 [cs.DS] UPDATED)
    The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Locality Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic approaches exist to "learn" the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Oftentimes, these heuristics outperform the LSH-based algorithms on real datasets, but, almost always, come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries, or apply to datasets with an assumed extra structure/model. In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically. On the theoretical side, we exhibit a natural setting (dataset model) where our algorithm is much better than the standard theoretical one. On the practical side, we run experiments that show that our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets.
    A Modern Self-Referential Weight Matrix That Learns to Modify Itself. (arXiv:2202.05780v2 [cs.LG] UPDATED)
    The weight matrix (WM) of a neural network (NN) is its program. The programs of many traditional NNs are learned through gradient descent in some error function, then remain fixed. The WM of a self-referential NN, however, can keep rapidly modifying all of itself during runtime. In principle, such NNs can meta-learn to learn, and meta-meta-learn to meta-learn to learn, and so on, in the sense of recursive self-improvement. While NN architectures potentially capable of implementing such behaviour have been proposed since the '90s, there have been few if any practical studies. Here we revisit such NNs, building upon recent successes of fast weight programmers and closely related linear Transformers. We propose a scalable self-referential WM (SRWM) that learns to use outer products and the delta update rule to modify itself. We evaluate our SRWM in supervised few-shot learning and in multi-task reinforcement learning with procedurally generated game environments. Our experiments demonstrate both practical applicability and competitive performance of the proposed SRWM. Our code is public.
    Variational Nested Dropout. (arXiv:2101.11353v2 [cs.LG] UPDATED)
    Nested dropout is a variant of dropout operation that is able to order network parameters or features based on the pre-defined importance during training. It has been explored for: I. Constructing nested nets: the nested nets are neural networks whose architectures can be adjusted instantly during testing time, e.g., based on computational constraints. The nested dropout implicitly ranks the network parameters, generating a set of sub-networks such that any smaller sub-network forms the basis of a larger one. II. Learning ordered representation: the nested dropout applied to the latent representation of a generative model (e.g., auto-encoder) ranks the features, enforcing explicit order of the dense representation over dimensions. However, the dropout rate is fixed as a hyper-parameter during the whole training process. For nested nets, when network parameters are removed, the performance decays in a human-specified trajectory rather than in a trajectory learned from data. For generative models, the importance of features is specified as a constant vector, restraining the flexibility of representation learning. To address the problem, we focus on the probabilistic counterpart of the nested dropout. We propose a variational nested dropout (VND) operation that draws samples of multi-dimensional ordered masks at a low cost, providing useful gradients to the parameters of nested dropout. Based on this approach, we design a Bayesian nested neural network that learns the order knowledge of the parameter distributions. We further exploit the VND under different generative models for learning ordered latent distributions. In experiments, we show that the proposed approach outperforms the nested network in terms of accuracy, calibration, and out-of-domain detection in classification tasks. It also outperforms the related generative models on data generation tasks.
    Fairness in Credit Scoring: Assessment, Implementation and Profit Implications. (arXiv:2103.01907v4 [stat.ML] UPDATED)
    The rise of algorithmic decision-making has spawned much research on fair machine learning (ML). Financial institutions use ML for building risk scorecards that support a range of credit-related decisions. Yet, the literature on fair ML in credit scoring is scarce. The paper makes three contributions. First, we revisit statistical fairness criteria and examine their adequacy for credit scoring. Second, we catalog algorithmic options for incorporating fairness goals in the ML model development pipeline. Last, we empirically compare different fairness processors in a profit-oriented credit scoring context using real-world data. The empirical results substantiate the evaluation of fairness measures, identify suitable options to implement fair credit scoring, and clarify the profit-fairness trade-off in lending decisions. We find that multiple fairness criteria can be approximately satisfied at once and recommend separation as a proper criterion for measuring the fairness of a scorecard. We also find fair in-processors to deliver a good balance between profit and fairness and show that algorithmic discrimination can be reduced to a reasonable level at a relatively low cost. The codes corresponding to the paper are available on GitHub.
    Label-Descriptive Patterns and Their Application to Characterizing Classification Errors. (arXiv:2110.09599v3 [cs.LG] UPDATED)
    State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors, but also gives a way to act and improve the classifier. We propose to discover those feature-value combinations (i.e., patterns) that strongly correlate with correct resp. erroneous predictions to obtain a global and interpretable description for arbitrary classifiers. We show this is an instance of the more general label description problem, which we formulate in terms of the Minimum Description Length principle. To discover a good pattern set, we develop the efficient Premise algorithm. Through an extensive set of experiments we show it performs very well in practice on both synthetic and real-world data. Unlike existing solutions, it ably recovers ground truth patterns, even on highly imbalanced data over many features. Through two case studies on Visual Question Answering and Named Entity Recognition, we confirm that Premise gives clear and actionable insight into the systematic errors made by modern NLP classifiers.
    Anti-Money Laundering Alert Optimization Using Machine Learning with Graphs. (arXiv:2112.07508v3 [cs.LG] UPDATED)
    Money laundering is a global problem that concerns legitimizing proceeds from serious felonies (1.7-4 trillion euros annually) such as drug dealing, human trafficking, or corruption. The anti-money laundering systems deployed by financial institutions typically comprise rules aligned with regulatory frameworks. Human investigators review the alerts and report suspicious cases. Such systems suffer from high false-positive rates, undermining their effectiveness and resulting in high operational costs. We propose a machine learning triage model, which complements the rule-based system and learns to predict the risk of an alert accurately. Our model uses both entity-centric engineered features and attributes characterizing inter-entity relations in the form of graph-based features. We leverage time windows to construct the dynamic graph, optimizing for time and space efficiency. We validate our model on a real-world banking dataset and show how the triage model can reduce the number of false positives by 80% while detecting over 90% of true positives. In this way, our model can significantly improve anti-money laundering operations.
    Spectral CUSUM for Online Network Structure Change Detection. (arXiv:1910.09083v3 [math.ST] UPDATED)
    Detecting abrupt changes in the community structure of a network from noisy observations is a fundamental problem in statistics and machine learning. This paper presents an online change detection algorithm called Spectral-CUSUM to detect unknown network structure changes through a generalized likelihood ratio statistic. We characterize the average run length (ARL) and the expected detection delay (EDD) of the Spectral-CUSUM procedure and prove its asymptotic optimality. Finally, we demonstrate the good performance of the Spectral-CUSUM procedure and compare it with several baseline methods using simulations and real data examples on seismic event detection using sensor network data.
    NAFS: A Simple yet Tough-to-beat Baseline for Graph Representation Learning. (arXiv:2206.08583v1 [cs.LG])
    Recently, graph neural networks (GNNs) have shown prominent performance in graph representation learning by leveraging knowledge from both graph structure and node features. However, most of them have two major limitations. First, GNNs can learn higher-order structural information by stacking more layers but can not deal with large depth due to the over-smoothing issue. Second, it is not easy to apply these methods on large graphs due to the expensive computation cost and high memory usage. In this paper, we present node-adaptive feature smoothing (NAFS), a simple non-parametric method that constructs node representations without parameter learning. NAFS first extracts the features of each node with its neighbors of different hops by feature smoothing, and then adaptively combines the smoothed features. Besides, the constructed node representation can further be enhanced by the ensemble of smoothed features extracted via different smoothing strategies. We conduct experiments on four benchmark datasets on two different application scenarios: node clustering and link prediction. Remarkably, NAFS with feature ensemble outperforms the state-of-the-art GNNs on these tasks and mitigates the aforementioned two limitations of most learning-based GNN counterparts.  ( 2 min )
    SYMBA: Symbolic Computation of Squared Amplitudes in High Energy Physics with Machine ALearning. (arXiv:2206.08901v1 [hep-ph])
    The cross section is one of the most important physical quantities in high-energy physics and the most time consuming to compute. While machine learning has proven to be highly successful in numerical calculations in high-energy physics, analytical calculations using machine learning are still in their infancy. In this work, we use a sequence-to-sequence transformer model to compute a key element of the cross section calculation, namely, the squared amplitude of an interaction. We show that a transformer model is able to predict correctly 89.0% and 99.4% of squared amplitudes of QCD and QED processes, respectively. We discuss the performance of the current model, its limitations and possible future directions for this work.
    Distinguishing rule- and exemplar-based generalization in learning systems. (arXiv:2110.04328v2 [cs.LG] UPDATED)
    Machine learning systems often do not share the same inductive biases as humans and, as a result, extrapolate or generalize in ways that are inconsistent with our expectations. The trade-off between exemplar- and rule-based generalization has been studied extensively in cognitive psychology; in this work, we present a protocol inspired by these experimental approaches to probe the inductive biases that control this tradeoff in category-learning systems. We isolate two such inductive biases: feature-level bias (differences in which features are more readily learned) and exemplar or rule bias (differences in how these learned features are used for generalization). We find that standard neural network models are feature-biased and exemplar-based, and discuss the implications of these findings for machine learning research on systematic generalization, fairness, and data augmentation.
    CausalVAE: Structured Causal Disentanglement in Variational Autoencoder. (arXiv:2004.08697v6 [cs.LG] UPDATED)
    Learning disentanglement aims at finding a low dimensional representation which consists of multiple explanatory and generative factors of the observational data. The framework of variational autoencoder (VAE) is commonly used to disentangle independent factors from observations. However, in real scenarios, factors with semantics are not necessarily independent. Instead, there might be an underlying causal structure which renders these factors dependent. We thus propose a new VAE based framework named CausalVAE, which includes a Causal Layer to transform independent exogenous factors into causal endogenous ones that correspond to causally related concepts in data. We further analyze the model identifiabitily, showing that the proposed model learned from observations recovers the true one up to a certain degree. Experiments are conducted on various datasets, including synthetic and real word benchmark CelebA. Results show that the causal representations learned by CausalVAE are semantically interpretable, and their causal relationship as a Directed Acyclic Graph (DAG) is identified with good accuracy. Furthermore, we demonstrate that the proposed CausalVAE model is able to generate counterfactual data through "do-operation" to the causal factors.
    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. (arXiv:2101.03961v3 [cs.LG] UPDATED)
    In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.
    Adversarial Attack and Defense for Non-Parametric Two-Sample Tests. (arXiv:2202.03077v2 [cs.LG] UPDATED)
    Non-parametric two-sample tests (TSTs) that judge whether two sets of samples are drawn from the same distribution, have been widely used in the analysis of critical data. People tend to employ TSTs as trusted basic tools and rarely have any doubt about their reliability. This paper systematically uncovers the failure mode of non-parametric TSTs through adversarial attacks and then proposes corresponding defense strategies. First, we theoretically show that an adversary can upper-bound the distributional shift which guarantees the attack's invisibility. Furthermore, we theoretically find that the adversary can also degrade the lower bound of a TST's test power, which enables us to iteratively minimize the test criterion in order to search for adversarial pairs. To enable TST-agnostic attacks, we propose an ensemble attack (EA) framework that jointly minimizes the different types of test criteria. Second, to robustify TSTs, we propose a max-min optimization that iteratively generates adversarial pairs to train the deep kernels. Extensive experiments on both simulated and real-world datasets validate the adversarial vulnerabilities of non-parametric TSTs and the effectiveness of our proposed defense. Source code is available at https://github.com/GodXuxilie/Robust-TST.git.
    MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. (arXiv:2206.08853v1 [cs.LG])
    Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite and knowledge bases (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.
    Smoothing Policies and Safe Policy Gradients. (arXiv:1905.03231v2 [cs.LG] UPDATED)
    Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a policy gradient algorithm with monotonic improvement guarantees.
    Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective. (arXiv:2110.06256v2 [cs.LG] UPDATED)
    This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
    Boosting Factorization Machines via Saliency-Guided Mixup. (arXiv:2206.08661v1 [cs.IR])
    Factorization machines (FMs) are widely used in recommender systems due to their adaptability and ability to learn from sparse data. However, for the ubiquitous non-interactive features in sparse data, existing FMs can only estimate the parameters corresponding to these features via the inner product of their embeddings. Undeniably, they cannot learn the direct interactions of these features, which limits the model's expressive power. To this end, we first present MixFM, inspired by Mixup, to generate auxiliary training data to boost FMs. Unlike existing augmentation strategies that require labor costs and expertise to collect additional information such as position and fields, these extra data generated by MixFM only by the convex combination of the raw ones without any professional knowledge support. More importantly, if the parent samples to be mixed have non-interactive features, MixFM will establish their direct interactions. Second, considering that MixFM may generate redundant or even detrimental instances, we further put forward a novel Factorization Machine powered by Saliency-guided Mixup (denoted as SMFM). Guided by the customized saliency, SMFM can generate more informative neighbor data. Through theoretical analysis, we prove that the proposed methods minimize the upper bound of the generalization error, which hold a beneficial effect on enhancing FMs. Significantly, we give the first generalization bound of FM, implying the generalization requires more data and a smaller embedding size under the sufficient representation capability. Finally, extensive experiments on five datasets confirm that our approaches are superior to baselines. Besides, the results show that "poisoning" mixed data is likewise beneficial to the FM variants.  ( 3 min )
    MetaFed: Federated Learning among Federations with Cyclic Knowledge Distillation for Personalized Healthcare. (arXiv:2206.08516v1 [cs.LG])
    Federated learning has attracted increasing attention to building models without accessing the raw user data, especially in healthcare. In real applications, different federations can seldom work together due to possible reasons such as data heterogeneity and distrust/inexistence of the central server. In this paper, we propose a novel framework called MetaFed to facilitate trustworthy FL between different federations. MetaFed obtains a personalized model for each federation without a central server via the proposed Cyclic Knowledge Distillation. Specifically, MetaFed treats each federation as a meta distribution and aggregates knowledge of each federation in a cyclic manner. The training is split into two parts: common knowledge accumulation and personalization. Comprehensive experiments on three benchmarks demonstrate that MetaFed without a server achieves better accuracy compared to state-of-the-art methods (e.g., 10%+ accuracy improvement compared to the baseline for PAMAP2) with fewer communication costs.  ( 2 min )
    Boosting Graph Structure Learning with Dummy Nodes. (arXiv:2206.08561v1 [cs.LG])
    With the development of graph kernels and graph representation learning, many superior methods have been proposed to handle scalability and oversmoothing issues on graph structure learning. However, most of those strategies are designed based on practical experience rather than theoretical analysis. In this paper, we use a particular dummy node connecting to all existing vertices without affecting original vertex and edge properties. We further prove that such the dummy node can help build an efficient monomorphic edge-to-vertex transform and an epimorphic inverse to recover the original graph back. It also indicates that adding dummy nodes can preserve local and global structures for better graph representation learning. We extend graph kernels and graph neural networks with dummy nodes and conduct experiments on graph classification and subgraph isomorphism matching tasks. Empirical results demonstrate that taking graphs with dummy nodes as input significantly boosts graph structure learning, and using their edge-to-vertex graphs can also achieve similar results. We also discuss the gain of expressive power from the dummy in neural networks.  ( 2 min )
    Neural Ensemble Search via Bayesian Sampling. (arXiv:2109.02533v2 [cs.LG] UPDATED)
    Recently, neural architecture search (NAS) has been applied to automate the design of neural networks in real-world applications. A large number of algorithms have been developed to improve the search cost or the performance of the final selected architectures in NAS. Unfortunately, these NAS algorithms aim to select only one single well-performing architecture from their search spaces and thus have overlooked the capability of neural network ensemble (i.e., an ensemble of neural networks with diverse architectures) in achieving improved performance over a single final selected architecture. To this end, we introduce a novel neural ensemble search algorithm, called neural ensemble search via Bayesian sampling (NESBS), to effectively and efficiently select well-performing neural network ensembles from a NAS search space. In our extensive experiments, NESBS algorithm is shown to be able to achieve improved performance over state-of-the-art NAS algorithms while incurring a comparable search cost, thus indicating the superior performance of our NESBS algorithm over these NAS algorithms in practice.
    Popular decision tree algorithms are provably noise tolerant. (arXiv:2206.08899v1 [cs.LG])
    Using the framework of boosting, we prove that all impurity-based decision tree learning algorithms, including the classic ID3, C4.5, and CART, are highly noise tolerant. Our guarantees hold under the strongest noise model of nasty noise, and we provide near-matching upper and lower bounds on the allowable noise rate. We further show that these algorithms, which are simple and have long been central to everyday machine learning, enjoy provable guarantees in the noisy setting that are unmatched by existing algorithms in the theoretical literature on decision tree learning. Taken together, our results add to an ongoing line of research that seeks to place the empirical success of these practical decision tree algorithms on firm theoretical footing.
    Feature and Parameter Selection in Stochastic Linear Bandits. (arXiv:2106.05378v3 [cs.LG] UPDATED)
    We study two model selection settings in stochastic linear bandits (LB). In the first setting, which we refer to as feature selection, the expected reward of the LB problem is in the linear span of at least one of $M$ feature maps (models). In the second setting, the reward parameter of the LB problem is arbitrarily selected from $M$ models represented as (possibly) overlapping balls in $\mathbb R^d$. However, the agent only has access to misspecified models, i.e.,~estimates of the centers and radii of the balls. We refer to this setting as parameter selection. For each setting, we develop and analyze a computationally efficient algorithm that is based on a reduction from bandits to full-information problems. This allows us to obtain regret bounds that are not worse (up to a $\sqrt{\log M}$ factor) than the case where the true model is known. This is the best-reported dependence on the number of models $M$ in these settings. Finally, we empirically show the effectiveness of our algorithms using synthetic and real-world experiments.
    Learngene: From Open-World to Your Learning Task. (arXiv:2106.06788v3 [cs.LG] UPDATED)
    Although deep learning has made significant progress on fixed large-scale datasets, it typically encounters challenges regarding improperly detecting unknown/unseen classes in the open-world scenario, over-parametrized, and overfitting small samples. Since biological systems can overcome the above difficulties very well, individuals inherit an innate gene from collective creatures that have evolved over hundreds of millions of years and then learn new skills through few examples. Inspired by this, we propose a practical collective-individual paradigm where an evolution (expandable) network is trained on sequential tasks and then recognize unknown classes in real-world. Moreover, the learngene, i.e., the gene for learning initialization rules of the target model, is proposed to inherit the meta-knowledge from the collective model and reconstruct a lightweight individual model on the target task. Particularly, a novel criterion is proposed to discover learngene in the collective model, according to the gradient information. Finally, the individual model is trained only with few samples on the target learning tasks. We demonstrate the effectiveness of our approach in an extensive empirical study and theoretical analysis.
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v1 [stat.ML])
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: density estimation on the sphere, variational inference or hyperspherical auto-encoders.
    Open-Sampling: Exploring Out-of-Distribution data for Re-balancing Long-tailed datasets. (arXiv:2206.08802v1 [cs.LG])
    Deep neural networks usually perform poorly when the training dataset suffers from extreme class imbalance. Recent studies found that directly training with out-of-distribution data (i.e., open-set samples) in a semi-supervised manner would harm the generalization performance. In this work, we theoretically show that out-of-distribution data can still be leveraged to augment the minority classes from a Bayesian perspective. Based on this motivation, we propose a novel method called Open-sampling, which utilizes open-set noisy labels to re-balance the class priors of the training dataset. For each open-set instance, the label is sampled from our pre-defined distribution that is complementary to the distribution of original class priors. We empirically show that Open-sampling not only re-balances the class priors but also encourages the neural network to learn separable representations. Extensive experiments demonstrate that our proposed method significantly outperforms existing data re-balancing methods and can boost the performance of existing state-of-the-art methods.
    Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay. (arXiv:2206.08756v1 [math.ST])
    We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also give the first rigorous evidence for the statistical-computational gap in scalar-on-tensor regression under the low-degree polynomials framework. Our theory demonstrates a ``blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially ``cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
    You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism. (arXiv:2110.14802v2 [cs.LG] UPDATED)
    I consider a setting where reviewers offer very noisy scores for several items for the selection of high-quality ones (e.g., peer review of large conference proceedings), whereas the owner of these items knows the true underlying scores but prefers not to provide this information. To address this withholding of information, in this paper, I introduce the Isotonic Mechanism, a simple and efficient approach to improving imprecise raw scores by leveraging certain information that the owner is incentivized to provide. This mechanism takes the ranking of the items from best to worst provided by the owner as input, in addition to the raw scores provided by the reviewers. It reports the adjusted scores for the items by solving a convex optimization problem. Under certain conditions, I show that the owner's optimal strategy is to honestly report the true ranking of the items to her best knowledge in order to maximize the expected utility. Moreover, I prove that the adjusted scores provided by this owner-assisted mechanism are significantly more accurate than the raw scores provided by the reviewers. This paper concludes with several extensions of the Isotonic Mechanism and some refinements of the mechanism for practical consideration.
    DISCO: Comprehensive and Explainable Disinformation Detection. (arXiv:2203.04928v2 [cs.LG] UPDATED)
    Disinformation refers to false information deliberately spread to influence the general public, and the negative impact of disinformation on society can be observed in numerous issues, such as political agendas and manipulating financial markets. In this paper, we identify prevalent challenges and advances related to automated disinformation detection from multiple aspects and propose a comprehensive and explainable disinformation detection framework called DISCO. It leverages the heterogeneity of disinformation and addresses the opaqueness of prediction. Then we provide a demonstration of DISCO on a real-world fake news detection task with satisfactory detection accuracy and explanation. The demo video and source code of DISCO is now publicly available. We expect that our demo could pave the way for addressing the limitations of identification, comprehension, and explainability as a whole.
    Explainability's Gain is Optimality's Loss? -- How Explanations Bias Decision-making. (arXiv:2206.08705v1 [cs.HC])
    Decisions in organizations are about evaluating alternatives and choosing the one that would best serve organizational goals. To the extent that the evaluation of alternatives could be formulated as a predictive task with appropriate metrics, machine learning algorithms are increasingly being used to improve the efficiency of the process. Explanations help to facilitate communication between the algorithm and the human decision-maker, making it easier for the latter to interpret and make decisions on the basis of predictions by the former. Feature-based explanations' semantics of causal models, however, induce leakage from the decision-maker's prior beliefs. Our findings from a field experiment demonstrate empirically how this leads to confirmation bias and disparate impact on the decision-maker's confidence in the predictions. Such differences can lead to sub-optimal and biased decision outcomes.
    Mirror Descent with Relative Smoothness in Measure Spaces, with application to Sinkhorn and EM. (arXiv:2206.08873v1 [math.OC])
    Many problems in machine learning can be formulated as optimizing a convex functional over a space of measures. This paper studies the convergence of the mirror descent algorithm in this infinite-dimensional setting. Defining Bregman divergences through directional derivatives, we derive the convergence of the scheme for relatively smooth and strongly convex pairs of functionals. Applying our result to joint distributions and the Kullback--Leibler (KL) divergence, we show that Sinkhorn's primal iterations for entropic optimal transport in the continuous setting correspond to a mirror descent, and we obtain a new proof of its (sub)linear convergence. We also show that Expectation Maximization (EM) can always formally be written as a mirror descent, and, when optimizing on the latent distribution while fixing the mixtures, we derive sublinear rates of convergence.
    Lossy Compression with Gaussian Diffusion. (arXiv:2206.08889v1 [stat.ML])
    We describe a novel lossy compression approach called DiffC which is based on unconditional diffusion generative models. Unlike modern compression schemes which rely on transform coding and quantization to restrict the transmitted information, DiffC relies on the efficient communication of pixels corrupted by Gaussian noise. We implement a proof of concept and find that it works surprisingly well despite the lack of an encoder transform, outperforming the state-of-the-art generative compression method HiFiC on ImageNet 64x64. DiffC only uses a single model to encode and denoise corrupted pixels at arbitrary bitrates. The approach further provides support for progressive coding, that is, decoding from partial bit streams. We perform a rate-distortion analysis to gain a deeper understanding of its performance, providing analytical results for multivariate Gaussian data as well as initial results for general distributions. Furthermore, we show that a flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high bitrates.
    GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns. (arXiv:2104.03958v2 [cs.CL] UPDATED)
    Data exploration is an important step of every data science and machine learning project, including those involving textual data. We provide a novel language tool, in the form of a publicly available Python library for extracting patterns from textual data. The library integrates a first public implementation of the existing GrASP algorithm. It allows users to extract patterns using a number of general-purpose built-in linguistic attributes (such as hypernyms, part-of-speech tags, and syntactic dependency tags), as envisaged for the original algorithm, as well as domain-specific custom attributes which can be incorporated into the library by implementing two functions. The library is equipped with a web-based interface empowering human users to conveniently explore data via the extracted patterns, using complementary pattern-centric and example-centric views: the former includes a reading in natural language and statistics of each extracted pattern; the latter shows applications of each extracted pattern to training examples. We demonstrate the usefulness of the library in classification (spam detection and argument mining), model analysis (machine translation), and artifact discovery in datasets (SNLI and 20Newsgroups).
    Decentralized adaptive clustering of deep nets is beneficial for client collaboration. (arXiv:2206.08839v1 [cs.LG])
    We study the problem of training personalized deep learning models in a decentralized peer-to-peer setting, focusing on the setting where data distributions differ between the clients and where different clients have different local learning tasks. We study both covariate and label shift, and our contribution is an algorithm which for each client finds beneficial collaborations based on a similarity estimate for the local task. Our method does not rely on hyperparameters which are hard to estimate, such as the number of client clusters, but rather continuously adapts to the network topology using soft cluster assignment based on a novel adaptive gossip algorithm. We test the proposed method in various settings where data is not independent and identically distributed among the clients. The experimental evaluation shows that the proposed method performs better than previous state-of-the-art algorithms for this problem setting, and handles situations well where previous methods fail.
    Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features. (arXiv:2111.03740v2 [cs.LG] UPDATED)
    Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with an implementation available at https://github.com/OoDBag/WR
    Incorporating intratumoral heterogeneity into weakly-supervised deep learning models via variance pooling. (arXiv:2206.08885v1 [eess.IV])
    Supervised learning tasks such as cancer survival prediction from gigapixel whole slide images (WSIs) are a critical challenge in computational pathology that requires modeling complex features of the tumor microenvironment. These learning tasks are often solved with deep multi-instance learning (MIL) models that do not explicitly capture intratumoral heterogeneity. We develop a novel variance pooling architecture that enables a MIL model to incorporate intratumoral heterogeneity into its predictions. Two interpretability tools based on representative patches are illustrated to probe the biological signals captured by these models. An empirical study with 4,479 gigapixel WSIs from the Cancer Genome Atlas shows that adding variance pooling onto MIL frameworks improves survival prediction performance for five cancer types.
    Maximum Class Separation as Inductive Bias in One Matrix. (arXiv:2206.08704v1 [cs.LG])
    Maximizing the separation between classes constitutes a well-known inductive bias in machine learning and a pillar of many traditional algorithms. By default, deep networks are not equipped with this inductive bias and therefore many alternative solutions have been proposed through differential optimization. Current approaches tend to optimize classification and separation jointly: aligning inputs with class vectors and separating class vectors angularly. This paper proposes a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations. The main observation behind our approach is that separation does not require optimization but can be solved in closed-form prior to training and plugged into a network. We outline a recursive approach to obtain the matrix consisting of maximally separable vectors for any number of classes, which can be added with negligible engineering effort and computational overhead. Despite its simple nature, this one matrix multiplication provides real impact. We show that our proposal directly boosts classification, long-tailed recognition, out-of-distribution detection, and open-set recognition, from CIFAR to ImageNet. We find empirically that maximum separation works best as a fixed bias; making the matrix learnable adds nothing to the performance. The closed-form implementation and code to reproduce the experiments are on github.
    Beyond Ridge Regression for Distribution-Free Data. (arXiv:2206.08757v1 [cs.LG])
    In supervised batch learning, the predictive normalized maximum likelihood (pNML) has been proposed as the min-max regret solution for the distribution-free setting, where no distributional assumptions are made on the data. However, the pNML is not defined for a large capacity hypothesis class as over-parameterized linear regression. For a large class, a common approach is to use regularization or a model prior. In the context of online prediction where the min-max solution is the Normalized Maximum Likelihood (NML), it has been suggested to use NML with ``luckiness'': A prior-like function is applied to the hypothesis class, which reduces its effective size. Motivated by the luckiness concept, for linear regression we incorporate a luckiness function that penalizes the hypothesis proportionally to its l2 norm. This leads to the ridge regression solution. The associated pNML with luckiness (LpNML) prediction deviates from the ridge regression empirical risk minimizer (Ridge ERM): When the test data reside in the subspace corresponding to the small eigenvalues of the empirical correlation matrix of the training data, the prediction is shifted toward 0. Our LpNML reduces the Ridge ERM error by up to 20% for the PMLB sets, and is up to 4.9% more robust in the presence of distribution shift compared to recent leading methods for UCI sets.
    AutoML Two-Sample Test. (arXiv:2206.08843v1 [cs.LG])
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.
    BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers. (arXiv:2206.08680v1 [cs.CL])
    Code-Mixed text data consists of sentences having words or phrases from more than one language. Most multi-lingual communities worldwide communicate using multiple languages, with English usually one of them. Hinglish is a Code-Mixed text composed of Hindi and English but written in Roman script. This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system. For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences to predict the quality of synthetically generated Hinglish sentences.
    Nudge: Accelerating Overdue Pull Requests Towards Completion. (arXiv:2011.12468v5 [cs.SE] UPDATED)
    Pull requests are a key part of the collaborative software development and code review process today. However, pull requests can also slow down the software development process when the reviewer(s) or the author do not actively engage with the pull request. In this work, we design an end-to-end service, Nudge, for accelerating overdue pull requests towards completion by reminding the author or the reviewer(s) to engage with their overdue pull requests. First, we use models based on effort estimation and machine learning to predict the completion time for a given pull request. Second, we use activity detection to filter out pull requests that may be overdue, but for which sufficient action is taking place nonetheless. Lastly, we use actor identification to understand who the blocker of the pull request is and nudge the appropriate actor (author or reviewer(s)). The key novelty of Nudge is that it succeeds in reducing pull request resolution time, while ensuring that developers perceive the notifications sent as useful, at the scale of thousands of repositories. In a randomized trial on 147 repositories in use at Microsoft, Nudge was able to reduce pull request resolution time by 60% for 8,500 pull requests, when compared to overdue pull requests for which Nudge did not send a notification. Furthermore, developers receiving Nudge notifications resolved 73% of these notifications as positive. We observed similar results when scaling up the deployment of Nudge to 8,000 repositories at Microsoft, for which Nudge sent 210,000 notifications during a full year. This demonstrates Nudge's ability to scale to thousands of repositories. Lastly, our qualitative analysis of a selection of Nudge notifications indicates areas for future research, such as taking dependencies among pull requests and developer availability into account.
    The Role of Depth, Width, and Activation Complexity in the Number of Linear Regions of Neural Networks. (arXiv:2206.08615v1 [cs.LG])
    Many feedforward neural networks generate continuous and piecewise-linear (CPWL) mappings. Specifically, they partition the input domain into regions on which the mapping is an affine function. The number of these so-called linear regions offers a natural metric to characterize the expressiveness of CPWL mappings. Although the precise determination of this quantity is often out of reach, bounds have been proposed for specific architectures, including the well-known ReLU and Maxout networks. In this work, we propose a more general perspective and provide precise bounds on the maximal number of linear regions of CPWL networks based on three sources of expressiveness: depth, width, and activation complexity. Our estimates rely on the combinatorial structure of convex partitions and highlight the distinctive role of depth which, on its own, is able to exponentially increase the number of regions. We then introduce a complementary stochastic framework to estimate the average number of linear regions produced by a CPWL network architecture. Under reasonable assumptions, the expected density of linear regions along any 1D path is bounded by the product of depth, width, and a measure of activation complexity (up to a scaling factor). This yields an identical role to the three sources of expressiveness: no exponential growth with depth is observed anymore.
    Understanding Decision-Time vs. Background Planning in Model-Based Reinforcement Learning. (arXiv:2206.08442v1 [cs.LG])
    In model-based reinforcement learning, an agent can leverage a learned model to improve its way of behaving in different ways. Two prevalent approaches are decision-time planning and background planning. In this study, we are interested in understanding under what conditions and in which settings one of these two planning styles will perform better than the other in domains that require fast responses. After viewing them through the lens of dynamic programming, we first consider the classical instantiations of these planning styles and provide theoretical results and hypotheses on which one will perform better in the pure planning, planning & learning, and transfer learning settings. We then consider the modern instantiations of these planning styles and provide hypotheses on which one will perform better in the last two of the considered settings. Lastly, we perform several illustrative experiments to empirically validate both our theoretical results and hypotheses. Overall, our findings suggest that even though decision-time planning does not perform as well as background planning in their classical instantiations, in their modern instantiations, it can perform on par or better than background planning in both the planning & learning and transfer learning settings.
    A Deep Learning Approach for the Segmentation of Electroencephalography Data in Eye Tracking Applications. (arXiv:2206.08672v1 [cs.LG])
    The collection of eye gaze information provides a window into many critical aspects of human cognition, health and behaviour. Additionally, many neuroscientific studies complement the behavioural information gained from eye tracking with the high temporal resolution and neurophysiological markers provided by electroencephalography (EEG). One of the essential eye-tracking software processing steps is the segmentation of the continuous data stream into events relevant to eye-tracking applications, such as saccades, fixations, and blinks. Here, we introduce DETRtime, a novel framework for time-series segmentation that creates ocular event detectors that do not require additionally recorded eye-tracking modality and rely solely on EEG data. Our end-to-end deep learning-based framework brings recent advances in Computer Vision to the forefront of the times series segmentation of EEG data. DETRtime achieves state-of-the-art performance in ocular event detection across diverse eye-tracking experiment paradigms. In addition to that, we provide evidence that our model generalizes well in the task of EEG sleep stage segmentation.
    ReViSe: Remote Vital Signs Measurement Using Smartphone Camera. (arXiv:2206.08748v1 [cs.CV])
    Remote Photoplethysmography (rPPG) is a fast, effective, inexpensive and convenient method for collecting biometric data as it enables vital signs estimation using face videos. Remote contactless medical service provisioning has proven to be a dire necessity during the COVID-19 pandemic. We propose an end-to-end framework to measure people's vital signs including Heart Rate (HR), Heart Rate Variability (HRV), Oxygen Saturation (SpO2) and Blood Pressure (BP) based on the rPPG methodology from the video of a user's face captured with a smartphone camera. We extract face landmarks with a deep learning-based neural network model in real-time. Multiple face patches also called Region-of-Interests (RoIs) are extracted by using the predicted face landmarks. Several filters are applied to reduce the noise from the RoIs in the extracted cardiac signals called Blood Volume Pulse (BVP) signal. We trained and validated machine learning models using two public rPPG datasets namely the TokyoTech rPPG and the Pulse Rate Detection (PURE) datasets, on which our models achieved the following Mean Absolute Errors (MAE): a) for HR, 1.73 and 3.95 Beats-Per-Minute (bpm) respectively, b) for HRV, 18.55 and 25.03 ms respectively, and c) for SpO2, a MAE of 1.64 on the PURE dataset. We validated our end-to-end rPPG framework, ReViSe, in real life environment, and thereby created the Video-HR dataset. Our HR estimation model achieved a MAE of 2.49 bpm on this dataset. Since no publicly available rPPG datasets existed for BP measurement with face videos, we used a dataset with signals from fingertip sensor to train our model and also created our own video dataset, Video-BP. On our Video-BP dataset, our BP estimation model achieved a MAE of 6.7 mmHg for Systolic Blood Pressure (SBP), and a MAE of 9.6 mmHg for Diastolic Blood Pressure (DBP).
    Active Sampling for Min-Max Fairness. (arXiv:2006.06879v3 [stat.ML] UPDATED)
    We propose simple active sampling and reweighting strategies for optimizing min-max fairness that can be applied to any classification or regression model learned via loss minimization. The key intuition behind our approach is to use at each timestep a datapoint from the group that is worst off under the current model for updating the model. The ease of implementation and the generality of our robust formulation make it an attractive option for improving model performance on disadvantaged groups. For convex learning problems, such as linear or logistic regression, we provide a fine-grained analysis, proving the rate of convergence to a min-max fair solution.
    Thompson Sampling Achieves $\tilde O(\sqrt{T})$ Regret in Linear Quadratic Control. (arXiv:2206.08520v1 [cs.LG])
    Thompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde O(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers, or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde O(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in Abeille and Lazaric (2018). TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of TSAC in several adaptive control tasks.
    Understanding Robust Overfitting of Adversarial Training and Beyond. (arXiv:2206.08675v1 [cs.LG])
    Robust overfitting widely exists in adversarial training of deep networks. The exact underlying reasons for this are still not completely understood. Here, we explore the causes of robust overfitting by comparing the data distribution of \emph{non-overfit} (weak adversary) and \emph{overfitted} (strong adversary) adversarial training, and observe that the distribution of the adversarial data generated by weak adversary mainly contain small-loss data. However, the adversarial data generated by strong adversary is more diversely distributed on the large-loss data and the small-loss data. Given these observations, we further designed data ablation adversarial training and identify that some small-loss data which are not worthy of the adversary strength cause robust overfitting in the strong adversary mode. To relieve this issue, we propose \emph{minimum loss constrained adversarial training} (MLCAT): in a minibatch, we learn large-loss data as usual, and adopt additional measures to increase the loss of the small-loss data. Technically, MLCAT hinders data fitting when they become easy to learn to prevent robust overfitting; philosophically, MLCAT reflects the spirit of turning waste into treasure and making the best use of each adversarial data; algorithmically, we designed two realizations of MLCAT, and extensive experiments demonstrate that MLCAT can eliminate robust overfitting and further boost adversarial robustness.
    Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs. (arXiv:2206.08709v1 [cs.CL])
    Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.
    Improving Generalization of Metric Learning via Listwise Self-distillation. (arXiv:2206.08880v1 [cs.CV])
    Most deep metric learning (DML) methods employ a strategy that forces all positive samples to be close in the embedding space while keeping them away from negative ones. However, such a strategy ignores the internal relationships of positive (negative) samples and often leads to overfitting, especially in the presence of hard samples and mislabeled samples. In this work, we propose a simple yet effective regularization, namely Listwise Self-Distillation (LSD), which progressively distills a model's own knowledge to adaptively assign a more appropriate distance target to each sample pair in a batch. LSD encourages smoother embeddings and information mining within positive (negative) samples as a way to mitigate overfitting and thus improve generalization. Our LSD can be directly integrated into general DML frameworks. Extensive experiments show that LSD consistently boosts the performance of various metric learning methods on multiple datasets.
    Communication-Efficient Adaptive Federated Learning. (arXiv:2205.02719v2 [cs.LG] UPDATED)
    Federated learning is a machine learning training paradigm that enables clients to jointly train models without sharing their own localized data. However, the implementation of federated learning in practice still faces numerous challenges, such as the large communication overhead due to the repetitive server-client synchronization and the lack of adaptivity by SGD-based model updates. Despite that various methods have been proposed for reducing the communication cost by gradient compression or quantization, and the federated versions of adaptive optimizers such as FedAdam are proposed to add more adaptivity, the current federated learning framework still cannot solve the aforementioned challenges all at once. In this paper, we propose a novel communication-efficient adaptive federated learning method (FedCAMS) with theoretical convergence guarantees. We show that in the nonconvex stochastic optimization setting, our proposed FedCAMS achieves the same convergence rate of $O(\frac{1}{\sqrt{TKm}})$ as its non-compressed counterparts. Extensive experiments on various benchmarks verify our theoretical analysis.  ( 2 min )
    XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence. (arXiv:2206.08474v1 [cs.SE])
    Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. Open source repositories like GitHub enable this process with rich unlabeled code data. However, the lack of high quality labeled data has largely hindered the progress of several code related tasks, such as program translation, summarization, synthesis, and code search. This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-lingual code tasks. To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages. We also provide the performance of several state-of-the-art baseline models for each task. We believe this new dataset can be a valuable asset for the research community and facilitate the development and validation of new methods for cross-lingual code intelligence.  ( 2 min )
    Local overlap reduction procedure for dynamic ensemble selection. (arXiv:2206.08455v1 [cs.LG])
    Class imbalance is a characteristic known for making learning more challenging for classification models as they may end up biased towards the majority class. A promising approach among the ensemble-based methods in the context of imbalance learning is Dynamic Selection (DS). DS techniques single out a subset of the classifiers in the ensemble to label each given unknown sample according to their estimated competence in the area surrounding the query. Because only a small region is taken into account in the selection scheme, the global class disproportion may have less impact over the system's performance. However, the presence of local class overlap may severely hinder the DS techniques' performance over imbalanced distributions as it not only exacerbates the effects of the under-representation but also introduces ambiguous and possibly unreliable samples to the competence estimation process. Thus, in this work, we propose a DS technique which attempts to minimize the effects of the local class overlap during the classifier selection procedure. The proposed method iteratively removes from the target region the instance perceived as the hardest to classify until a classifier is deemed competent to label the query sample. The known samples are characterized using instance hardness measures that quantify the local class overlap. Experimental results show that the proposed technique can significantly outperform the baseline as well as several other DS techniques, suggesting its suitability for dealing with class under-representation and overlap. Furthermore, the proposed technique still yielded competitive results when using an under-sampled, less overlapped version of the labelled sets, specially over the problems with a high proportion of minority class samples in overlap areas. Code available at https://github.com/marianaasouza/lords.  ( 3 min )
    Debugging using Orthogonal Gradient Descent. (arXiv:2206.08489v1 [cs.LG])
    In this report we consider the following problem: Given a trained model that is partially faulty, can we correct its behaviour without having to train the model from scratch? In other words, can we ``debug" neural networks similar to how we address bugs in our mathematical models and standard computer code. We base our approach on the hypothesis that debugging can be treated as a two-task continual learning problem. In particular, we employ a modified version of a continual learning algorithm called Orthogonal Gradient Descent (OGD) to demonstrate, via two simple experiments on the MNIST dataset, that we can in-fact \textit{unlearn} the undesirable behaviour while retaining the general performance of the model, and we can additionally \textit{relearn} the appropriate behaviour, both without having to train the model from scratch.  ( 2 min )
    SATBench: Benchmarking the speed-accuracy tradeoff in object recognition by humans and dynamic neural networks. (arXiv:2206.08427v1 [cs.CV])
    The core of everyday tasks like reading and driving is active object recognition. Attempts to model such tasks are currently stymied by the inability to incorporate time. People show a flexible tradeoff between speed and accuracy and this tradeoff is a crucial human skill. Deep neural networks have emerged as promising candidates for predicting peak human object recognition performance and neural activity. However, modeling the temporal dimension i.e., the speed-accuracy tradeoff (SAT), is essential for them to serve as useful computational models for how humans recognize objects. To this end, we here present the first large-scale (148 observers, 4 neural networks, 8 tasks) dataset of the speed-accuracy tradeoff (SAT) in recognizing ImageNet images. In each human trial, a beep, indicating the desired reaction time, sounds at a fixed delay after the image is presented, and observer's response counts only if it occurs near the time of the beep. In a series of blocks, we test many beep latencies, i.e., reaction times. We observe that human accuracy increases with reaction time and proceed to compare its characteristics with the behavior of several dynamic neural networks that are capable of inference-time adaptive computation. Using FLOPs as an analog for reaction time, we compare networks with humans on curve-fit error, category-wise correlation, and curve steepness, and conclude that cascaded dynamic neural networks are a promising model of human reaction time in object recognition tasks.  ( 3 min )
    Quantifying Feature Contributions to Overall Disparity Using Information Theory. (arXiv:2206.08454v1 [cs.LG])
    When a machine-learning algorithm makes biased decisions, it can be helpful to understand the sources of disparity to explain why the bias exists. Towards this, we examine the problem of quantifying the contribution of each individual feature to the observed disparity. If we have access to the decision-making model, one potential approach (inspired from intervention-based approaches in explainability literature) is to vary each individual feature (while keeping the others fixed) and use the resulting change in disparity to quantify its contribution. However, we may not have access to the model or be able to test/audit its outputs for individually varying features. Furthermore, the decision may not always be a deterministic function of the input features (e.g., with human-in-the-loop). For these situations, we might need to explain contributions using purely distributional (i.e., observational) techniques, rather than interventional. We ask the question: what is the "potential" contribution of each individual feature to the observed disparity in the decisions when the exact decision-making mechanism is not accessible? We first provide canonical examples (thought experiments) that help illustrate the difference between distributional and interventional approaches to explaining contributions, and when either is better suited. When unable to intervene on the inputs, we quantify the "redundant" statistical dependency about the protected attribute that is present in both the final decision and an individual feature, by leveraging a body of work in information theory called Partial Information Decomposition. We also perform a simple case study to show how this technique could be applied to quantify contributions.  ( 3 min )
    Towards a multi-stakeholder value-based assessment framework for algorithmic systems. (arXiv:2205.04525v2 [cs.LG] UPDATED)
    In an effort to regulate Machine Learning-driven (ML) systems, current auditing processes mostly focus on detecting harmful algorithmic biases. While these strategies have proven to be impactful, some values outlined in documents dealing with ethics in ML-driven systems are still underrepresented in auditing processes. Such unaddressed values mainly deal with contextual factors that cannot be easily quantified. In this paper, we develop a value-based assessment framework that is not limited to bias auditing and that covers prominent ethical principles for algorithmic systems. Our framework presents a circular arrangement of values with two bipolar dimensions that make common motivations and potential tensions explicit. In order to operationalize these high-level principles, values are then broken down into specific criteria and their manifestations. However, some of these value-specific criteria are mutually exclusive and require negotiation. As opposed to some other auditing frameworks that merely rely on ML researchers' and practitioners' input, we argue that it is necessary to include stakeholders that present diverse standpoints to systematically negotiate and consolidate value and criteria tensions. To that end, we map stakeholders with different insight needs, and assign tailored means for communicating value manifestations to them. We, therefore, contribute to current ML auditing practices with an assessment framework that visualizes closeness and tensions between values and we give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.  ( 3 min )
    The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention. (arXiv:2202.05798v2 [cs.LG] UPDATED)
    Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the entire training experience. While this has been technically known since the 1960s, no prior work has effectively studied the operations of NNs in such a form, presumably due to prohibitive time and space complexities and impractical model sizes, all of them growing linearly with the number of training patterns which may get very large. However, this dual formulation offers a possibility of directly visualising how an NN makes use of training patterns at test time, by examining the corresponding attention weights. We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings, as well as language modelling, and discuss potentials and limits of this view for better understanding and interpreting how NNs exploit training patterns. Our code is public.  ( 2 min )
    Local Augmentation for Graph Neural Networks. (arXiv:2109.03856v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved remarkable performance on graph-based tasks. The key idea for GNNs is to obtain informative representation through aggregating information from local neighborhoods. However, it remains an open question whether the neighborhood information is adequately aggregated for learning representations of nodes with few neighbors. To address this, we propose a simple and efficient data augmentation strategy, local augmentation, to learn the distribution of the node representations of the neighbors conditioned on the central node's representation and enhance GNN's expressive power with generated features. Local augmentation is a general framework that can be applied to any GNN model in a plug-and-play manner. It samples feature vectors associated with each node from the learned conditional distribution as additional input for the backbone model at each training iteration. Extensive experiments and analyses show that local augmentation consistently yields performance improvement when applied to various GNN architectures across a diverse set of benchmarks. For example, experiments show that plugging in local augmentation to GCN and GAT improves by an average of 3.4\% and 1.6\% in terms of test accuracy on Cora, Citeseer, and Pubmed. Besides, our experimental results on large graphs (OGB) show that our model consistently improves performance over backbones. Code is available at https://github.com/SongtaoLiu0823/LAGNN.
    Human Interpretation of Saliency-based Explanation Over Text. (arXiv:2201.11569v2 [cs.CL] UPDATED)
    While a lot of research in explainable AI focuses on producing effective explanations, less work is devoted to the question of how people understand and interpret the explanation. In this work, we focus on this question through a study of saliency-based explanations over textual data. Feature-attribution explanations of text models aim to communicate which parts of the input text were more influential than others towards the model decision. Many current explanation methods, such as gradient-based or Shapley value-based methods, provide measures of importance which are well-understood mathematically. But how does a person receiving the explanation (the explainee) comprehend it? And does their understanding match what the explanation attempted to communicate? We empirically investigate the effect of various factors of the input, the feature-attribution explanation, and visualization procedure, on laypeople's interpretation of the explanation. We query crowdworkers for their interpretation on tasks in English and German, and fit a GAMM model to their responses considering the factors of interest. We find that people often mis-interpret the explanations: superficial and unrelated factors, such as word length, influence the explainees' importance assignment despite the explanation communicating importance directly. We then show that some of this distortion can be attenuated: we propose a method to adjust saliencies based on model estimates of over- and under-perception, and explore bar charts as an alternative to heatmap saliency visualization. We find that both approaches can attenuate the distorting effect of specific factors, leading to better-calibrated understanding of the explanation.
    Large-Margin Representation Learning for Texture Classification. (arXiv:2206.08537v1 [cs.CV])
    This paper presents a novel approach combining convolutional layers (CLs) and large-margin metric learning for training supervised models on small datasets for texture classification. The core of such an approach is a loss function that computes the distances between instances of interest and support vectors. The objective is to update the weights of CLs iteratively to learn a representation with a large margin between classes. Each iteration results in a large-margin discriminant model represented by support vectors based on such a representation. The advantage of the proposed approach w.r.t. convolutional neural networks (CNNs) is two-fold. First, it allows representation learning with a small amount of data due to the reduced number of parameters compared to an equivalent CNN. Second, it has a low training cost since the backpropagation considers only support vectors. The experimental results on texture and histopathologic image datasets have shown that the proposed approach achieves competitive accuracy with lower computational cost and faster convergence when compared to equivalent CNNs.  ( 2 min )
    A Robust Stacking Framework for Training Deep Graph Models with Multifaceted Node Features. (arXiv:2206.08473v1 [cs.LG])
    Graph Neural Networks (GNNs) with numerical node features and graph structure as inputs have demonstrated superior performance on various supervised learning tasks with graph data. However the numerical node features utilized by GNNs are commonly extracted from raw data which is of text or tabular (numeric/categorical) type in most real-world applications. The best models for such data types in most standard supervised learning settings with IID (non-graph) data are not simple neural network layers and thus are not easily incorporated into a GNN. Here we propose a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data, which are ensembled and stacked in multiple layers. Our layer-wise framework leverages bagging and stacking strategies to enjoy strong generalization, in a manner which effectively mitigates label leakage and overfitting. Across a variety of graph datasets with tabular/text node features, our method achieves comparable or superior performance relative to both tabular/text and graph neural network models, as well as existing state-of-the-art hybrid strategies that combine the two.  ( 2 min )
    Geometrically Guided Integrated Gradients. (arXiv:2206.05903v2 [cs.CV] UPDATED)
    Interpretability methods for deep neural networks mainly focus on the sensitivity of the class score with respect to the original or perturbed input, usually measured using actual or modified gradients. Some methods also use a model-agnostic approach to understanding the rationale behind every prediction. In this paper, we argue and demonstrate that local geometry of the model parameter space relative to the input can also be beneficial for improved post-hoc explanations. To achieve this goal, we introduce an interpretability method called "geometrically-guided integrated gradients" that builds on top of the gradient calculation along a linear path as traditionally used in integrated gradient methods. However, instead of integrating gradient information, our method explores the model's dynamic behavior from multiple scaled versions of the input and captures the best possible attribution for each input. We demonstrate through extensive experiments that the proposed approach outperforms vanilla and integrated gradients in subjective and quantitative assessment. We also propose a "model perturbation" sanity check to complement the traditionally used "model randomization" test.  ( 2 min )
    Sanity Simulations for Saliency Methods. (arXiv:2105.06506v3 [cs.LG] UPDATED)
    Saliency methods are a popular class of feature attribution explanation methods that aim to capture a model's predictive reasoning by identifying "important" pixels in an input image. However, the development and adoption of these methods are hindered by the lack of access to ground-truth model reasoning, which prevents accurate evaluation. In this work, we design a synthetic benchmarking framework, SMERF, that allows us to perform ground-truth-based evaluation while controlling the complexity of the model's reasoning. Experimentally, SMERF reveals significant limitations in existing saliency methods and, as a result, represents a useful tool for the development of new saliency methods.
    OpenSRH: optimizing brain tumor surgery using intraoperative stimulated Raman histology. (arXiv:2206.08439v1 [eess.IV])
    Accurate intraoperative diagnosis is essential for providing safe and effective care during brain tumor surgery. Our standard-of-care diagnostic methods are time, resource, and labor intensive, which restricts access to optimal surgical treatments. To address these limitations, we propose an alternative workflow that combines stimulated Raman histology (SRH), a rapid optical imaging method, with deep learning-based automated interpretation of SRH images for intraoperative brain tumor diagnosis and real-time surgical decision support. Here, we present OpenSRH, the first public dataset of clinical SRH images from 300+ brain tumors patients and 1300+ unique whole slide optical images. OpenSRH contains data from the most common brain tumors diagnoses, full pathologic annotations, whole slide tumor segmentations, raw and processed optical imaging data for end-to-end model development and validation. We provide a framework for patch-based whole slide SRH classification and inference using weak (i.e. patient-level) diagnostic labels. Finally, we benchmark two computer vision tasks: multiclass histologic brain tumor classification and patch-based contrastive representation learning. We hope OpenSRH will facilitate the clinical translation of rapid optical imaging and real-time ML-based surgical decision support in order to improve the access, safety, and efficacy of cancer surgery in the era of precision medicine. Dataset access, code, and benchmarks are available at opensrh.mlins.org.  ( 2 min )
    Empirical Bayesian Approaches for Robust Constraint-based Causal Discovery under Insufficient Data. (arXiv:2206.08448v1 [cs.LG])
    Causal discovery is to learn cause-effect relationships among variables given observational data and is important for many applications. Existing causal discovery methods assume data sufficiency, which may not be the case in many real world datasets. As a result, many existing causal discovery methods can fail under limited data. In this work, we propose Bayesian-augmented frequentist independence tests to improve the performance of constraint-based causal discovery methods under insufficient data: 1) We firstly introduce a Bayesian method to estimate mutual information (MI), based on which we propose a robust MI based independence test; 2) Secondly, we consider the Bayesian estimation of hypothesis likelihood and incorporate it into a well-defined statistical test, resulting in a robust statistical testing based independence test. We apply proposed independence tests to constraint-based causal discovery methods and evaluate the performance on benchmark datasets with insufficient samples. Experiments show significant performance improvement in terms of both accuracy and efficiency over SOTA methods.  ( 2 min )
    Learning over All Stabilizing Nonlinear Controllers for a Partially-Observed Linear System. (arXiv:2112.04219v3 [eess.SY] UPDATED)
    This paper proposes a nonlinear policy architecture for control of partially-observed linear dynamical systems providing built-in closed-loop stability guarantees. The policy is based on a nonlinear version of the Youla parameterization, and augments a known stabilizing linear controller with a nonlinear operator from a recently developed class of dynamic neural network models called the recurrent equilibrium network (REN). We prove that RENs are universal approximators of contracting and Lipschitz nonlinear systems, and subsequently show that the the proposed Youla-REN architecture is a universal approximator of stabilizing nonlinear controllers. The REN architecture simplifies learning since unconstrained optimization can be applied, and we consider both a model-based case where exact gradients are available and reinforcement learning using random search with zeroth-order oracles. In simulation examples our method converges faster to better controllers and is more scalable than existing methods, while guaranteeing stability during learning transients.
    Learning a Single Neuron with Adversarial Label Noise via Gradient Descent. (arXiv:2206.08918v1 [cs.LG])
    We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapsto\sigma(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $\sigma:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=\epsilon$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(\sigma(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, \epsilon$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/\epsilon)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/\epsilon))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tilde{\Omega}(d/\epsilon)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.
    Learning to Teach Fairness-aware Deep Multi-task Learning. (arXiv:2206.08403v1 [cs.LG])
    Fairness-aware learning mainly focuses on single task learning (STL). The fairness implications of multi-task learning (MTL) have only recently been considered and a seminal approach has been proposed that considers the fairness-accuracy trade-off for each task and the performance trade-off among different tasks. Instead of a rigid fairness-accuracy trade-off formulation, we propose a flexible approach that learns how to be fair in a MTL setting by selecting which objective (accuracy or fairness) to optimize at each step. We introduce the L2T-FMT algorithm that is a teacher-student network trained collaboratively; the student learns to solve the fair MTL problem while the teacher instructs the student to learn from either accuracy or fairness, depending on what is harder to learn for each task. Moreover, this dynamic selection of which objective to use at each step for each task reduces the number of trade-off weights from 2T to T, where T is the number of tasks. Our experiments on three real datasets show that L2T-FMT improves on both fairness (12-19%) and accuracy (up to 2%) over state-of-the-art approaches.  ( 2 min )
    Resolution Limits of Non-Adaptive 20 Questions Search for a Moving Target. (arXiv:2206.08884v1 [cs.IT])
    Using the 20 questions estimation framework with query-dependent noise, we study non-adaptive search strategies for a moving target over the unit cube with unknown initial location and velocities under a piecewise constant velocity model. In this search problem, there is an oracle who knows the instantaneous location of the target at any time. Our task is to query the oracle as few times as possible to accurately estimate the location of the target at any specified time. We first study the case where the oracle's answer to each query is corrupted by discrete noise and then generalize our results to the case of additive white Gaussian noise. In our formulation, the performance criterion is the resolution, which is defined as the maximal $L_\infty$ distance between the true locations and estimated locations. We characterize the minimal resolution of an optimal non-adaptive query procedure with a finite number of queries by deriving non-asymptotic and asymptotic bounds. Our bounds are tight in the first-order asymptotic sense when the number of queries satisfies a certain condition and our bounds are tight in the stronger second-order asymptotic sense when the target moves with a constant velocity. To prove our results, we relate the current problem to channel coding, borrow ideas from finite blocklength information theory and construct bounds on the number of possible quantized target trajectories.
    Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms. (arXiv:2206.08776v1 [cs.LG])
    We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a ''per-load'' reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the "per-load" reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the "per-load" reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound's first term is the same as regret lower bound's, and its second and third terms also evidently correspond to lower bound's. Extensive experiments validate our algorithm's performance and also its gain in 5G & 4G base station selection.
    A Convergence Theory for SVGD in the Population Limit under Talagrand's Inequality T1. (arXiv:2106.03076v2 [cs.LG] UPDATED)
    Stein Variational Gradient Descent (SVGD) is an algorithm for sampling from a target density which is known up to a multiplicative constant. Although SVGD is a popular algorithm in practice, its theoretical study is limited to a few recent works. We study the convergence of SVGD in the population limit, (i.e., with an infinite number of particles) to sample from a non-logconcave target distribution satisfying Talagrand's inequality T1. We first establish the convergence of the algorithm. Then, we establish a dimension-dependent complexity bound in terms of the Kernelized Stein Discrepancy (KSD). Unlike existing works, we do not assume that the KSD is bounded along the trajectory of the algorithm. Our approach relies on interpreting SVGD as a gradient descent over a space of probability measures.
    SMPL: Simulated Industrial Manufacturing and Process Control Learning Environments. (arXiv:2206.08851v1 [cs.LG])
    Traditional biological and pharmaceutical manufacturing plants are controlled by human workers or pre-defined thresholds. Modernized factories have advanced process control algorithms such as model predictive control (MPC). However, there is little exploration of applying deep reinforcement learning to control manufacturing plants. One of the reasons is the lack of high fidelity simulations and standard APIs for benchmarking. To bridge this gap, we develop an easy-to-use library that includes five high-fidelity simulation environments: BeerFMTEnv, ReactorEnv, AtropineEnv, PenSimEnv and mAbEnv, which cover a wide range of manufacturing processes. We build these environments on published dynamics models. Furthermore, we benchmark online and offline, model-based and model-free reinforcement learning algorithms for comparisons of follow-up research.
    Classification of datasets with imputed missing values: does imputation quality matter?. (arXiv:2206.08478v1 [cs.LG])
    Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete, imputed, samples. The focus of the machine learning researcher is then to optimise the downstream classification performance. In this study, we highlight that it is imperative to consider the quality of the imputation. We demonstrate how the commonly used measures for assessing quality are flawed and propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data. To conclude, we highlight the compromised interpretability of classifier models trained using poorly imputed data.
    Backdoor Attacks on Vision Transformers. (arXiv:2206.08477v1 [cs.CV])
    Vision Transformers (ViT) have recently demonstrated exemplary performance on a variety of vision tasks and are being used as an alternative to CNNs. Their design is based on a self-attention mechanism that processes images as a sequence of patches, which is quite different compared to CNNs. Hence it is interesting to study if ViTs are vulnerable to backdoor attacks. Backdoor attacks happen when an attacker poisons a small part of the training data for malicious purposes. The model performance is good on clean test images, but the attacker can manipulate the decision of the model by showing the trigger at test time. To the best of our knowledge, we are the first to show that ViTs are vulnerable to backdoor attacks. We also find an intriguing difference between ViTs and CNNs - interpretation algorithms effectively highlight the trigger on test images for ViTs but not for CNNs. Based on this observation, we propose a test-time image blocking defense for ViTs which reduces the attack success rate by a large margin. Code is available here: https://github.com/UCDvision/backdoor_transformer.git
    RECAPP: Crafting a More Efficient Catalyst for Convex Optimization. (arXiv:2206.08627v1 [math.OC])
    The accelerated proximal point algorithm (APPA), also known as "Catalyst", is a well-established reduction from convex optimization to approximate proximal point computation (i.e., regularized minimization). This reduction is conceptually elegant and yields strong convergence rate guarantees. However, these rates feature an extraneous logarithmic term arising from the need to compute each proximal point to high accuracy. In this work, we propose a novel Relaxed Error Criterion for Accelerated Proximal Point (RECAPP) that eliminates the need for high accuracy subproblem solutions. We apply RECAPP to two canonical problems: finite-sum and max-structured minimization. For finite-sum problems, we match the best known complexity, previously obtained by carefully-designed problem-specific algorithms. For minimizing $\max_y f(x,y)$ where $f$ is convex in $x$ and strongly-concave in $y$, we improve on the best known (Catalyst-based) bound by a logarithmic factor.
    Active Fairness Auditing. (arXiv:2206.08450v1 [cs.LG])
    The fast spreading adoption of machine learning (ML) by companies across industries poses significant regulatory challenges. One such challenge is scalability: how can regulatory bodies efficiently audit these ML models, ensuring that they are fair? In this paper, we initiate the study of query-based auditing algorithms that can estimate the demographic parity of ML models in a query-efficient manner. We propose an optimal deterministic algorithm, as well as a practical randomized, oracle-efficient algorithm with comparable guarantees. Furthermore, we make inroads into understanding the optimal query complexity of randomized active fairness estimation algorithms. Our first exploration of active fairness estimation aims to put AI governance on firmer theoretical foundations.
    SOS: Score-based Oversampling for Tabular Data. (arXiv:2206.08555v1 [cs.LG])
    Score-based generative models (SGMs) are a recent breakthrough in generating fake images. SGMs are known to surpass other generative models, e.g., generative adversarial networks (GANs) and variational autoencoders (VAEs). Being inspired by their big success, in this work, we fully customize them for generating fake tabular data. In particular, we are interested in oversampling minor classes since imbalanced classes frequently lead to sub-optimal training outcomes. To our knowledge, we are the first presenting a score-based tabular data oversampling method. Firstly, we re-design our own score network since we have to process tabular data. Secondly, we propose two options for our generation method: the former is equivalent to a style transfer for tabular data and the latter uses the standard generative policy of SGMs. Lastly, we define a fine-tuning method, which further enhances the oversampling quality. In our experiments with 6 datasets and 10 baselines, our method outperforms other oversampling methods in all cases.
    PRANC: Pseudo RAndom Networks for Compacting deep models. (arXiv:2206.08464v1 [cs.LG])
    Communication becomes a bottleneck in various distributed Machine Learning settings. Here, we propose a novel training framework that leads to highly efficient communication of models between agents. In short, we train our network to be a linear combination of many pseudo-randomly generated frozen models. For communication, the source agent transmits only the `seed' scalar used to generate the pseudo-random `basis' networks along with the learned linear mixture coefficients. Our method, denoted as PRANC, learns almost $100\times$ fewer parameters than a deep model and still performs well on several datasets and architectures. PRANC enables 1) efficient communication of models between agents, 2) efficient model storage, and 3) accelerated inference by generating layer-wise weights on the fly. We test PRANC on CIFAR-10, CIFAR-100, tinyImageNet, and ImageNet-100 with various architectures like AlexNet, LeNet, ResNet18, ResNet20, and ResNet56 and demonstrate a massive reduction in the number of parameters while providing satisfactory performance on these benchmark datasets. The code is available \href{https://github.com/UCDvision/PRANC}{https://github.com/UCDvision/PRANC}
    Sheaf Neural Networks with Connection Laplacians. (arXiv:2206.08702v1 [cs.LG])
    A Sheaf Neural Network (SNN) is a type of Graph Neural Network (GNN) that operates on a sheaf, an object that equips a graph with vector spaces over its nodes and edges and linear maps between these spaces. SNNs have been shown to have useful theoretical properties that help tackle issues arising from heterophily and over-smoothing. One complication intrinsic to these models is finding a good sheaf for the task to be solved. Previous works proposed two diametrically opposed approaches: manually constructing the sheaf based on domain knowledge and learning the sheaf end-to-end using gradient-based methods. However, domain knowledge is often insufficient, while learning a sheaf could lead to overfitting and significant computational overhead. In this work, we propose a novel way of computing sheaves drawing inspiration from Riemannian geometry: we leverage the manifold assumption to compute manifold-and-graph-aware orthogonal maps, which optimally align the tangent spaces of neighbouring data points. We show that this approach achieves promising results with less computational overhead when compared to previous SNN models. Overall, this work provides an interesting connection between algebraic topology and differential geometry, and we hope that it will spark future research in this direction.
    Recursive Neural Programs: Variational Learning of Image Grammars and Part-Whole Hierarchies. (arXiv:2206.08462v1 [cs.CV])
    Human vision involves parsing and representing objects and scenes using structured representations based on part-whole hierarchies. Computer vision and machine learning researchers have recently sought to emulate this capability using capsule networks, reference frames and active predictive coding, but a generative model formulation has been lacking. We introduce Recursive Neural Programs (RNPs), which, to our knowledge, is the first neural generative model to address the part-whole hierarchy learning problem. RNPs model images as hierarchical trees of probabilistic sensory-motor programs that recursively reuse learned sensory-motor primitives to model an image within different reference frames, forming recursive image grammars. We express RNPs as structured variational autoencoders (sVAEs) for inference and sampling, and demonstrate parts-based parsing, sampling and one-shot transfer learning for MNIST, Omniglot and Fashion-MNIST datasets, demonstrating the model's expressive power. Our results show that RNPs provide an intuitive and explainable way of composing objects and scenes, allowing rich compositionality and intuitive interpretations of objects in terms of part-whole hierarchies.
    Thompson Sampling for Robust Transfer in Multi-Task Bandits. (arXiv:2206.08556v1 [cs.LG])
    We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.  ( 2 min )
    Generalised Policy Improvement with Geometric Policy Composition. (arXiv:2206.08736v1 [stat.ML])
    We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.  ( 2 min )
    Discovery of the Content and Engagement with the Content. (arXiv:2206.08786v1 [cs.IR])
    In the second half of the 20th century, Parliament allowed broadcasters to transmit radio and eventually television coverage of debates and meetings of select committees. More recently, in an effort to further improve transparency and citizen engagement, the UK Parliament started publishing videos of these debates and meetings itself, and tweeting details of debates as they happened. In this paper, we attempt to characterise how people engage with video data of Parliamentary debates by using more than two years of Google Analytics data around these videos. We analyse the patterns of engagement - how do they land on a particular video? How do they hear about this video, i.e., what is the (HTTP) referrer website that led to the user clicking on the video? Once a user lands on a video, how do they engage with it? For how long is the video played? What is the next destination? etc. Answering these questions is an important first step towards understanding why and how people use Parliamentary videos, and therefore, how the video delivery platform should be adapted and personalised for the needs of the citizens of the country. Taking inspiration from An, Kwak, and Jansen (2017), we employ Non-Negative Matrix Factorization (NMF) (Lee and Seung, 1999) on the video views matrix to identify different archetypes of users, and identify archetypes. A deeper examination of the archetypes we find reveals that they are primarily distinguished by how they land on the video page: Search (i.e., through a search engine), Referral (i.e., from other Parliamentary websites), Direct (i.e., through a direct link, which is embedded on another website), Social (i.e., through a social platform such as Facebook or Twitter) and Others.  ( 3 min )
    NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates. (arXiv:2206.08545v1 [eess.AS])
    Conventionally, audio super-resolution models fixed the initial and the target sampling rates, which necessitate the model to be trained for each pair of sampling rates. We introduce NU-Wave 2, a diffusion model for neural audio upsampling that enables the generation of 48 kHz audio signals from inputs of various sampling rates with a single model. Based on the architecture of NU-Wave, NU-Wave 2 uses short-time Fourier convolution (STFC) to generate harmonics to resolve the main failure modes of NU-Wave, and incorporates bandwidth spectral feature transform (BSFT) to condition the bandwidths of inputs in the frequency domain. We experimentally demonstrate that NU-Wave 2 produces high-resolution audio regardless of the sampling rate of input while requiring fewer parameters than other models. The official code and the audio samples are available at https://mindslab-ai.github.io/nuwave2.  ( 2 min )
    The Sensorium competition on predicting large-scale mouse primary visual cortex activity. (arXiv:2206.08666v1 [q-bio.NC])
    The neural underpinning of the biological visual system is challenging to study experimentally, in particular as the neuronal activity becomes increasingly nonlinear with respect to visual input. Artificial neural networks (ANNs) can serve a variety of goals for improving our understanding of this complex system, not only serving as predictive digital twins of sensory cortex for novel hypothesis generation in silico, but also incorporating bio-inspired architectural motifs to progressively bridge the gap between biological and machine vision. The mouse has recently emerged as a popular model system to study visual information processing, but no standardized large-scale benchmark to identify state-of-the-art models of the mouse visual system has been established. To fill this gap, we propose the Sensorium benchmark competition. We collected a large-scale dataset from mouse primary visual cortex containing the responses of more than 28,000 neurons across seven mice stimulated with thousands of natural images, together with simultaneous behavioral measurements that include running speed, pupil dilation, and eye movements. The benchmark challenge will rank models based on predictive performance for neuronal responses on a held-out test set, and includes two tracks for model input limited to either stimulus only (Sensorium) or stimulus plus behavior (Sensorium+). We provide a starting kit to lower the barrier for entry, including tutorials, pre-trained baseline models, and APIs with one line commands for data loading and submission. We would like to see this as a starting point for regular challenges and data releases, and as a standard tool for measuring progress in large-scale neural system identification models of the mouse visual system and beyond.  ( 3 min )
    Automatic Correction of Human Translations. (arXiv:2206.08593v1 [cs.CL])
    We introduce translation error correction (TEC), the task of automatically correcting human-generated translations. Imperfections in machine translations (MT) have long motivated systems for improving translations post-hoc with automatic post-editing. In contrast, little attention has been devoted to the problem of automatically correcting human translations, despite the intuition that humans make distinct errors that machines would be well-suited to assist with, from typos to inconsistencies in translation conventions. To investigate this, we build and release the Aced corpus with three TEC datasets. We show that human errors in TEC exhibit a more diverse range of errors and far fewer translation fluency errors than the MT errors in automatic post-editing datasets, suggesting the need for dedicated TEC models that are specialized to correct human errors. We show that pre-training instead on synthetic errors based on human errors improves TEC F-score by as much as 5.1 points. We conducted a human-in-the-loop user study with nine professional translation editors and found that the assistance of our TEC system led them to produce significantly higher quality revised translations.  ( 2 min )
    Powershap: A Power-full Shapley Feature Selection Method. (arXiv:2206.08394v1 [cs.LG])
    Feature selection is a crucial step in developing robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with high-dimensional feature sets. Alternatively, filter methods are considerably faster, but suffer from several other disadvantages, such as (i) requiring a threshold value, (ii) not taking into account intercorrelation between features, and (iii) ignoring feature interactions with the model. To this end, we present powershap, a novel wrapper feature selection method, which leverages statistical hypothesis testing and power calculations in combination with Shapley values for quick and intuitive feature selection. Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature. Benchmarks and simulations show that powershap outperforms other filter methods with predictive performances on par with wrapper methods while being significantly faster, often even reaching half or a third of the execution time. As such, powershap provides a competitive and quick algorithm that can be used by various models in different domains. Furthermore, powershap is implemented as a plug-and-play and open-source sklearn component, enabling easy integration in conventional data science pipelines. User experience is even further enhanced by also providing an automatic mode that automatically tunes the hyper-parameters of the powershap algorithm, allowing to use the algorithm without any configuration needed.  ( 3 min )
    Accelerating numerical methods by gradient-based meta-solving. (arXiv:2206.08594v1 [math.NA])
    In science and engineering applications, it is often required to solve similar computational problems repeatedly. In such cases, we can utilize the data from previously solved problem instances to improve the efficiency of finding subsequent solutions. This offers a unique opportunity to combine machine learning (in particular, meta-learning) and scientific computing. To date, a variety of such domain-specific methods have been proposed in the literature, but a generic approach for designing these methods remains under-explored. In this paper, we tackle this issue by formulating a general framework to describe these problems, and propose a gradient-based algorithm to solve them in a unified way. As an illustration of this approach, we study the adaptive generation of parameters for iterative solvers to accelerate the solution of differential equations. We demonstrate the performance and versatility of our method through theoretical analysis and numerical experiments, including applications to incompressible flow simulations and an inverse problem of parameter estimation.  ( 2 min )
    Modeling Structure with Undirected Neural Networks. (arXiv:2202.03760v2 [cs.LG] UPDATED)
    Neural networks are powerful function estimators, leading to their status as a paradigm of choice for modeling structured data. However, unlike other structured representations that emphasize the modularity of the problem -- e.g., factor graphs -- neural networks are usually monolithic mappings from inputs to outputs, with a fixed computation order. This limitation prevents them from capturing different directions of computation and interaction between the modeled variables. In this paper, we combine the representational strengths of factor graphs and of neural networks, proposing undirected neural networks (UNNs): a flexible framework for specifying computations that can be performed in any order. For particular choices, our proposed models subsume and extend many existing architectures: feed-forward, recurrent, self-attention networks, auto-encoders, and networks with implicit layers. We demonstrate the effectiveness of undirected neural architectures, both unstructured and structured, on a range of tasks: tree-constrained dependency parsing, convolutional image classification, and sequence completion with attention. By varying the computation order, we show how a single UNN can be used both as a classifier and a prototype generator, and how it can fill in missing parts of an input sequence, making them a promising field for further research.
    How Powerful are Spectral Graph Neural Networks. (arXiv:2205.11172v2 [cs.LG] UPDATED)
    Spectral Graph Neural Network is a kind of Graph Neural Network (GNN) based on graph signal filters. Some models able to learn arbitrary spectral filters have emerged recently. However, few works analyze the expressive power of spectral GNNs. This paper studies spectral GNNs' expressive power theoretically. We first prove that even spectral GNNs without nonlinearity can produce arbitrary graph signals and give two conditions for reaching universality. They are: 1) no multiple eigenvalues of graph Laplacian, and 2) no missing frequency components in node features. We also establish a connection between the expressive power of spectral GNNs and Graph Isomorphism (GI) testing, the latter of which is often used to characterize spatial GNNs' expressive power. Moreover, we study the difference in empirical performance among different spectral GNNs with the same expressive power from an optimization perspective, and motivate the use of an orthogonal basis whose weight function corresponds to the graph signal density in the spectrum. Inspired by the analysis, we propose JacobiConv, which uses Jacobi basis due to its orthogonality and flexibility to adapt to a wide range of weight functions. JacobiConv deserts nonlinearity while outperforming all baselines on both synthetic and real-world datasets.
    Graph Neural Networks for Multimodal Single-Cell Data Integration. (arXiv:2203.01884v2 [cs.LG] UPDATED)
    Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single-modality datasets into the downstream analyses. To address these challenges and correspondingly facilitate multimodal single-cell data analyses, three key tasks have been introduced: $\textit{modality prediction}$, $\textit{modality matching}$ and $\textit{joint embedding}$. In this work, we present a general Graph Neural Network framework $\textit{scMoGNN}$ to tackle these three tasks and show that $\textit{scMoGNN}$ demonstrates superior results in all three tasks compared with the state-of-the-art and conventional approaches. Our method is an official winner in the overall ranking of \textit{Modality prediction} from NeurIPS 2021 Competition\footnote{\url{https://openproblems.bio/neurips_2021/}}, and all implementations of our methods have been integrated into DANCE package \footnote{\url{https://github.com/OmicsML/dance}}.
    Bayesian Spillover Graphs for Dynamic Networks. (arXiv:2203.01912v2 [stat.ME] UPDATED)
    We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying systemic risk.
    Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. (arXiv:2201.11729v2 [cs.LG] UPDATED)
    In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit an implicit tendency towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve the performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization.
    CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer. (arXiv:2206.08883v1 [cs.CV])
    Transformer has achieved great successes in learning vision and language representation, which is general across various downstream tasks. In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size. However, porting Transformer to sample-efficient visual control remains a challenging and unsolved problem. To this end, we propose a novel Control Transformer (CtrlFormer), possessing many appealing benefits that prior arts do not have. Firstly, CtrlFormer jointly learns self-attention mechanisms between visual tokens and policy tokens among different control tasks, where multitask representation can be learned and transferred without catastrophic forgetting. Secondly, we carefully design a contrastive reinforcement learning paradigm to train CtrlFormer, enabling it to achieve high sample efficiency, which is important in control problems. For example, in the DMControl benchmark, unlike recent advanced methods that failed by producing a zero score in the "Cartpole" task after transfer learning with 100k samples, CtrlFormer can achieve a state-of-the-art score with only 100k samples while maintaining the performance of previous tasks. The code and models are released in our project homepage.
    Generalized Frank-Wolfe Algorithm for Bilevel Optimization. (arXiv:2206.08868v1 [math.OC])
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are not satisfactory as they are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a generalization of the Frank-Wolfe (FW) method to solve the considered problem. The main idea of our method is to locally approximate the solution set of the lower-level problem via a cutting plane, and then run a FW-type update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We further prove stronger convergence guarantees under the H\"olderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered bilevel problem. We also present numerical experiments to showcase the superior performance of our method compared with state-of-the-art methods.
    Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification. (arXiv:2002.10061v3 [cs.LG] UPDATED)
    The Receptive Field (RF) size has been one of the most important factors for One Dimensional Convolutional Neural Networks (1D-CNNs) on time series classification tasks. Large efforts have been taken to choose the appropriate size because it has a huge influence on the performance and differs significantly for each dataset. In this paper, we propose an Omni-Scale block (OS-block) for 1D-CNNs, where the kernel sizes are decided by a simple and universal rule. Particularly, it is a set of kernel sizes that can efficiently cover the best RF size across different datasets via consisting of multiple prime numbers according to the length of the time series. The experiment result shows that models with the OS-block can achieve a similar performance as models with the searched optimal RF size and due to the strong optimal RF size capture ability, simple 1D-CNN models with OS-block achieves the state-of-the-art performance on four time series benchmarks, including both univariate and multivariate data from multiple domains. Comprehensive analysis and discussions shed light on why the OS-block can capture optimal RF sizes across different datasets. Code available [https://github.com/Wensi-Tang/OS-CNN]
    Avoid Overfitting User Specific Information in Federated Keyword Spotting. (arXiv:2206.08864v1 [cs.LG])
    Keyword spotting (KWS) aims to discriminate a specific wake-up word from other signals precisely and efficiently for different users. Recent works utilize various deep networks to train KWS models with all users' speech data centralized without considering data privacy. Federated KWS (FedKWS) could serve as a solution without directly sharing users' data. However, the small amount of data, different user habits, and various accents could lead to fatal problems, e.g., overfitting or weight divergence. Hence, we propose several strategies to encourage the model not to overfit user-specific information in FedKWS. Specifically, we first propose an adversarial learning strategy, which updates the downloaded global model against an overfitted local model and explicitly encourages the global model to capture user-invariant information. Furthermore, we propose an adaptive local training strategy, letting clients with more training data and more uniform class distributions undertake more local update steps. Equivalently, this strategy could weaken the negative impacts of those users whose data is less qualified. Our proposed FedKWS-UI could explicitly and implicitly learn user-invariant information in FedKWS. Abundant experimental results on federated Google Speech Commands verify the effectiveness of FedKWS-UI.
    A Survey of Sound Source Localization with Deep Learning Methods. (arXiv:2109.03465v3 [cs.SD] UPDATED)
    This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.
    Evaluating the Impact of Source Code Parsers on ML4SE Models. (arXiv:2206.08713v1 [cs.SE])
    As researchers and practitioners apply Machine Learning to increasingly more software engineering problems, the approaches they use become more sophisticated. A lot of modern approaches utilize internal code structure in the form of an abstract syntax tree (AST) or its extensions: path-based representation, complex graph combining AST with additional edges. Even though the process of extracting ASTs from code can be done with different parsers, the impact of choosing a parser on the final model quality remains unstudied. Moreover, researchers often omit the exact details of extracting particular code representations. In this work, we evaluate two models, namely Code2Seq and TreeLSTM, in the method name prediction task backed by eight different parsers for the Java language. To unify the process of data preparation with different parsers, we develop SuperParser, a multi-language parser-agnostic library based on PathMiner. SuperParser facilitates the end-to-end creation of datasets suitable for training and evaluation of ML models that work with structural information from source code. Our results demonstrate that trees built by different parsers vary in their structure and content. We then analyze how this diversity affects the models' quality and show that the quality gap between the most and least suitable parsers for both models turns out to be significant. Finally, we discuss other features of the parsers that researchers and practitioners should take into account when selecting a parser along with the impact on the models' quality. The code of SuperParser is publicly available at https://doi.org/10.5281/zenodo.6366591. We also publish Java-norm, the dataset we use to evaluate the models: https://doi.org/10.5281/zenodo.6366599.
    Dropout Prediction Uncertainty Estimation Using Neuron Activation Strength. (arXiv:2110.06435v3 [cs.LG] UPDATED)
    Dropout has been commonly used to quantify prediction uncertainty, i.e, the variations of model predictions on a given input example. However, using dropout in practice can be expensive as it requires running dropout inferences many times. In this paper, we study how to estimate dropout prediction uncertainty in a resource-efficient manner. We demonstrate that we can use neuron activation strengths to estimate dropout prediction uncertainty under different dropout settings and on a variety of tasks using three large datasets, MovieLens, Criteo, and EMNIST. Our approach provides an inference-once method to estimate dropout prediction uncertainty as a cheap auxiliary task. We also demonstrate that using activation features from a subset of the neural network layers can be sufficient to achieve uncertainty estimation performance almost comparable to that of using activation features from all layers, thus reducing resources even further for uncertainty estimation.
    Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization. (arXiv:2206.08573v1 [math.OC])
    We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $\min_{\mathbf{x}}\max_{\mathbf{y}}~F(\mathbf{x}) + H(\mathbf{x},\mathbf{y}) - G(\mathbf{y})$, where one has access to stochastic first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$. Building upon standard stochastic extragradient analysis for variational inequalities, we present a stochastic \emph{accelerated gradient-extragradient (AG-EG)} descent-ascent algorithm that combines extragradient and Nesterov's acceleration in general stochastic settings. This algorithm leverages scheduled restarting to admit a fine-grained nonasymptotic convergence rate that matches known lower bounds by both \citet{ibrahim2020linear} and \citet{zhang2021lower} in their corresponding settings, plus an additional statistical error term for bounded stochastic noise that is optimal up to a constant prefactor. This is the first result that achieves such a relatively mature characterization of optimality in saddle-point optimization.
    Leveraging Uncertainty in Deep Learning for Pancreatic Adenocarcinoma Grading. (arXiv:2206.08787v1 [eess.IV])
    Pancreatic cancers have one of the worst prognoses compared to other cancers, as they are diagnosed when cancer has progressed towards its latter stages. The current manual histological grading for diagnosing pancreatic adenocarcinomas is time-consuming and often results in misdiagnosis. In digital pathology, AI-based cancer grading must be extremely accurate in prediction and uncertainty quantification to improve reliability and explainability and are essential for gaining clinicians trust in the technology. We present Bayesian Convolutional Neural Networks for automated pancreatic cancer grading from MGG and HE stained images to estimate uncertainty in model prediction. We show that the estimated uncertainty correlates with prediction error. Specifically, it is useful in setting the acceptance threshold using a metric that weighs classification accuracy-reject trade-off and misclassification cost controlled by hyperparameters and can be employed in clinical settings.
    Truly Unordered Probabilistic Rule Sets for Multi-class Classification. (arXiv:2206.08804v1 [cs.LG])
    Rule set learning has long been studied and has recently been frequently revisited due to the need for interpretable models. Still, existing methods have several shortcomings: 1) most recent methods require a binary feature matrix as input, learning rules directly from numeric variables is understudied; 2) existing methods impose orders among rules, either explicitly or implicitly, which harms interpretability; and 3) currently no method exists for learning probabilistic rule sets for multi-class target variables (there is only a method for probabilistic rule lists). We propose TURS, for Truly Unordered Rule Sets, which addresses these shortcomings. We first formalise the problem of learning truly unordered rule sets. To resolve conflicts caused by overlapping rules, i.e., instances covered by multiple rules, we propose a novel approach that exploits the probabilistic properties of our rule sets. We next develop a two-phase heuristic algorithm that learns rule sets by carefully growing rules. An important innovation is that we use a surrogate score to take the global potential of the rule set into account when learning a local rule. Finally, we empirically demonstrate that, compared to non-probabilistic and (explicitly or implicitly) ordered state-of-the-art methods, our method learns rule sets that not only have better interpretability (i.e., they are smaller and truly unordered), but also better predictive performance.
    FedNew: A Communication-Efficient and Privacy-Preserving Newton-Type Method for Federated Learning. (arXiv:2206.08829v1 [cs.LG])
    Newton-type methods are popular in federated learning due to their fast convergence. Still, they suffer from two main issues, namely: low communication efficiency and low privacy due to the requirement of sending Hessian information from clients to parameter server (PS). In this work, we introduced a novel framework called FedNew in which there is no need to transmit Hessian information from clients to PS, hence resolving the bottleneck to improve communication efficiency. In addition, FedNew hides the gradient information and results in a privacy-preserving approach compared to the existing state-of-the-art. The core novel idea in FedNew is to introduce a two level framework, and alternate between updating the inverse Hessian-gradient product using only one alternating direction method of multipliers (ADMM) step and then performing the global model update using Newton's method. Though only one ADMM pass is used to approximate the inverse Hessian-gradient product at each iteration, we develop a novel theoretical approach to show the converging behavior of FedNew for convex problems. Additionally, a significant reduction in communication overhead is achieved by utilizing stochastic quantization. Numerical results using real datasets show the superiority of FedNew compared to existing methods in terms of communication costs.
    On Efficient Real-Time Semantic Segmentation: A Survey. (arXiv:2206.08605v1 [cs.CV])
    Semantic segmentation is the problem of assigning a class label to every pixel in an image, and is an important component of an autonomous vehicle vision stack for facilitating scene understanding and object detection. However, many of the top performing semantic segmentation models are extremely complex and cumbersome, and as such are not suited to deployment onboard autonomous vehicle platforms where computational resources are limited and low-latency operation is a vital requirement. In this survey, we take a thorough look at the works that aim to address this misalignment with more compact and efficient models capable of deployment on low-memory embedded systems while meeting the constraint of real-time inference. We discuss several of the most prominent works in the field, placing them within a taxonomy based on their major contributions, and finally we evaluate the inference speed of the discussed models under consistent hardware and software setups that represent a typical research environment with high-end GPU and a realistic deployed scenario using low-memory embedded GPU hardware. Our experimental results demonstrate that many works are capable of real-time performance on resource-constrained hardware, while illustrating the consistent trade-off between latency and accuracy.
    Learning Generic Lung Ultrasound Biomarkers for Decoupling Feature Extraction from Downstream Tasks. (arXiv:2206.08398v1 [eess.IV])
    Contemporary artificial neural networks (ANN) are trained end-to-end, jointly learning both features and classifiers for the task of interest. Though enormously effective, this paradigm imposes significant costs in assembling annotated task-specific datasets and training large-scale networks. We propose to decouple feature learning from downstream lung ultrasound tasks by introducing an auxiliary pre-task of visual biomarker classification. We demonstrate that one can learn an informative, concise, and interpretable feature space from ultrasound videos by training models for predicting biomarker labels. Notably, biomarker feature extractors can be trained from data annotated with weak video-scale supervision. These features can be used by a variety of downstream Expert models targeted for diverse clinical tasks (Diagnosis, lung severity, S/F ratio). Crucially, task-specific expert models are comparable in accuracy to end-to-end models directly trained for such target tasks, while being significantly lower cost to train.
    Revisiting Self-Distillation. (arXiv:2206.08491v1 [cs.LG])
    Knowledge distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student), often being used in the context of model compression. When both models have the same architecture, this procedure is called self-distillation. Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. In this work, we systematically study self-distillation in a number of settings. We first show that even with a highly accurate teacher, self-distillation allows a student to surpass the teacher in all cases. Secondly, we revisit existing theoretical explanations of (self) distillation and identify contradicting examples, revealing possible drawbacks of these explanations. Finally, we provide an alternative explanation for the dynamics of self-distillation through the lens of loss landscape geometry. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
    Bootstrapped Transformer for Offline Reinforcement Learning. (arXiv:2206.08569v1 [cs.LG])
    Offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment. Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem, adopting sequence models such as Transformer architecture to model distributions over trajectories, and repurposing beam search as a planning algorithm. However, the training datasets utilized in general offline RL tasks are quite limited and often suffer from insufficient distribution coverage, which could be harmful to training sequence generation models yet has not drawn enough attention in the previous works. In this paper, we propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. We also analyze the generated pseudo data and the revealed characteristics may shed some light on offline RL training. The codes are available at https://seqml.github.io/bootorl.
    tinySNN: Towards Memory- and Energy-Efficient Spiking Neural Networks. (arXiv:2206.08656v1 [cs.NE])
    Larger Spiking Neural Network (SNN) models are typically favorable as they can offer higher accuracy. However, employing such models on the resource- and energy-constrained embedded platforms is inefficient. Towards this, we present a tinySNN framework that optimizes the memory and energy requirements of SNN processing in both the training and inference phases, while keeping the accuracy high. It is achieved by reducing the SNN operations, improving the learning quality, quantizing the SNN parameters, and selecting the appropriate SNN model. Furthermore, our tinySNN quantizes different SNN parameters (i.e., weights and neuron parameters) to maximize the compression while exploring different combinations of quantization schemes, precision levels, and rounding schemes to find the model that provides acceptable accuracy. The experimental results demonstrate that our tinySNN significantly reduces the memory footprint and the energy consumption of SNNs without accuracy loss as compared to the baseline network. Therefore, our tinySNN effectively compresses the given SNN model to achieve high accuracy in a memory- and energy-efficient manner, hence enabling the employment of SNNs for the resource- and energy-constrained embedded applications.
    Scalable Differentially Private Clustering via Hierarchically Separated Trees. (arXiv:2206.08646v1 [cs.DS])
    We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / \epsilon^2)$, where $\epsilon$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
    Multimodal Attention-based Deep Learning for Alzheimer's Disease Diagnosis. (arXiv:2206.08826v1 [cs.LG])
    Alzheimer's Disease (AD) is the most common neurodegenerative disorder with one of the most complex pathogeneses, making effective and clinically actionable decision support difficult. The objective of this study was to develop a novel multimodal deep learning framework to aid medical professionals in AD diagnosis. We present a Multimodal Alzheimer's Disease Diagnosis framework (MADDi) to accurately detect the presence of AD and mild cognitive impairment (MCI) from imaging, genetic, and clinical data. MADDi is novel in that we use cross-modal attention, which captures interactions between modalities - a method not previously explored in this domain. We perform multi-class classification, a challenging task considering the strong similarities between MCI and AD. We compare with previous state-of-the-art models, evaluate the importance of attention, and examine the contribution of each modality to the model's performance. MADDi classifies MCI, AD, and controls with 96.88% accuracy on a held-out test set. When examining the contribution of different attention schemes, we found that the combination of cross-modal attention with self-attention performed the best, and no attention layers in the model performed the worst, with a 7.9% difference in F1-Scores. Our experiments underlined the importance of structured clinical data to help machine learning models contextualize and interpret the remaining modalities. Extensive ablation studies showed that any multimodal mixture of input features without access to structured clinical information suffered marked performance losses. This study demonstrates the merit of combining multiple input modalities via cross-modal attention to deliver highly accurate AD diagnostic decision support.
    Zero-Shot AutoML with Pretrained Models. (arXiv:2206.08476v1 [cs.LG])
    Given a new dataset D and a low compute budget, how should we choose a pre-trained model to fine-tune to D, and set the fine-tuning hyperparameters without risking overfitting, particularly if D is small? Here, we extend automated machine learning (AutoML) to best make these choices. Our domain-independent meta-learning approach learns a zero-shot surrogate model which, at test time, allows to select the right deep learning (DL) pipeline (including the pre-trained model and fine-tuning hyperparameters) for a new dataset D given only trivial meta-features describing D such as image resolution or the number of classes. To train this zero-shot model, we collect performance data for many DL pipelines on a large collection of datasets and meta-train on this data to minimize a pairwise ranking objective. We evaluate our approach under the strict time limit of the vision track of the ChaLearn AutoDL challenge benchmark, clearly outperforming all challenge contenders.
    Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization. (arXiv:2206.08575v1 [cs.LG])
    We focus on the problem of adversarial attacks against models on discrete sequential data in the black-box setting where the attacker aims to craft adversarial examples with limited query access to the victim model. Existing black-box attacks, mostly based on greedy algorithms, find adversarial examples using pre-computed key positions to perturb, which severely limits the search space and might result in suboptimal solutions. To this end, we propose a query-efficient black-box attack using Bayesian optimization, which dynamically computes important positions using an automatic relevance determination (ARD) categorical kernel. We introduce block decomposition and history subsampling techniques to improve the scalability of Bayesian optimization when an input sequence becomes long. Moreover, we develop a post-optimization algorithm that finds adversarial examples with smaller perturbation size. Experiments on natural language and protein classification tasks demonstrate that our method consistently achieves higher attack success rate with significant reduction in query count and modification rate compared to the previous state-of-the-art methods.
    Embarrassingly Parallel Independent Training of Multi-Layer Perceptrons with Heterogeneous Architectures. (arXiv:2206.08369v1 [cs.LG])
    The definition of a Neural Network architecture is one of the most critical and challenging tasks to perform. In this paper, we propose ParallelMLPs. ParallelMLPs is a procedure to enable the training of several independent Multilayer Perceptron Neural Networks with a different number of neurons and activation functions in parallel by exploring the principle of locality and parallelization capabilities of modern CPUs and GPUs. The core idea of this technique is to use a Modified Matrix Multiplication that replaces an ordinal matrix multiplication by two simple matrix operations that allow separate and independent paths for gradient flowing, which can be used in other scenarios. We have assessed our algorithm in simulated datasets varying the number of samples, features and batches using 10,000 different models. We achieved a training speedup from 1 to 4 orders of magnitude if compared to the sequential approach.
    Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency. (arXiv:2206.08496v1 [cs.LG])
    Pre-training on time series poses a unique challenge due to the potential mismatch between pre-training and target domains, such as shifts in temporal dynamics, fast-evolving trends, and long-range and short cyclic effects, which can lead to poor downstream performance. While domain adaptation methods can mitigate these shifts, most methods need examples directly from the target domain, making them suboptimal for pre-training. To address this challenge, methods need to accommodate target domains with different temporal dynamics and be capable of doing so without seeing any target examples during pre-training. Relative to other modalities, in time series, we expect that time-based and frequency-based representations of the same example are located close together in the time-frequency space. To this end, we posit that time-frequency consistency (TF-C) -- embedding a time-based neighborhood of a particular example close to its frequency-based neighborhood and back -- is desirable for pre-training. Motivated by TF-C, we define a decomposable pre-training model, where the self-supervised signal is provided by the distance between time and frequency components, each individually trained by contrastive estimation. We evaluate the new method on eight datasets, including electrodiagnostic testing, human activity recognition, mechanical fault detection, and physical status monitoring. Experiments against eight state-of-the-art methods show that TF-C outperforms baselines by 15.4% (F1 score) on average in one-to-one settings (e.g., fine-tuning an EEG-pretrained model on EMG data) and by up to 8.4% (F1 score) in challenging one-to-many settings, reflecting the breadth of scenarios that arise in real-world applications. The source code and datasets are available at https: //anonymous.4open.science/r/TFC-pretraining-6B07.
    On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models. (arXiv:2206.08598v1 [cs.LG])
    A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks.
    GOOD: A Graph Out-of-Distribution Benchmark. (arXiv:2206.08452v1 [cs.LG])
    Out-of-distribution (OOD) learning deals with scenarios in which training and test data follow different distributions. Although general OOD problems have been intensively studied in machine learning, graph OOD is only an emerging area of research. Currently, there lacks a systematic benchmark tailored to graph OOD method evaluation. In this work, we aim at developing an OOD benchmark, known as GOOD, for graphs specifically. We explicitly make distinctions between covariate and concept shifts and design data splits that accurately reflect different shifts. We consider both graph and node prediction tasks as there are key differences when designing shifts. Overall, GOOD contains 8 datasets with 14 domain selections. When combined with covariate, concept, and no shifts, we obtain 42 different splits. We provide performance results on 7 commonly used baseline methods with 10 random runs. This results in 294 dataset-model combinations in total. Our results show significant performance gaps between in-distribution and OOD settings. Our results also shed light on different performance trends between covariate and concept shifts by different methods. Our GOOD benchmark is a growing project and expects to expand in both quantity and variety of resources as the area develops. The GOOD benchmark can be accessed via $\href{https://github.com/divelab/GOOD/}{\text{https://github.com/divelab/GOOD/}}$.
    ComENet: Towards Complete and Efficient Message Passing for 3D Molecular Graphs. (arXiv:2206.08515v1 [cs.LG])
    Many real-world data can be modeled as 3D graphs, but learning representations that incorporates 3D information completely and efficiently is challenging. Existing methods either use partial 3D information, or suffer from excessive computational cost. To incorporate 3D information completely and efficiently, we propose a novel message passing scheme that operates within 1-hop neighborhood. Our method guarantees full completeness of 3D information on 3D graphs by achieving global and local completeness. Notably, we propose the important rotation angles to fulfill global completeness. Additionally, we show that our method is orders of magnitude faster than prior methods. We provide rigorous proof of completeness and analysis of time complexity for our methods. As molecules are in essence quantum systems, we build the \underline{com}plete and \underline{e}fficient graph neural network (ComENet) by combing quantum inspired basis functions and the proposed message passing scheme. Experimental results demonstrate the capability and efficiency of ComENet, especially on real-world datasets that are large in both numbers and sizes of graphs. Our code is publicly available as part of the DIG library (\url{https://github.com/divelab/DIG}).
    I Know What You Trained Last Summer: A Survey on Stealing Machine Learning Models and Defences. (arXiv:2206.08451v1 [cs.LG])
    Machine Learning-as-a-Service (MLaaS) has become a widespread paradigm, making even the most complex machine learning models available for clients via e.g. a pay-per-query principle. This allows users to avoid time-consuming processes of data collection, hyperparameter tuning, and model training. However, by giving their customers access to the (predictions of their) models, MLaaS providers endanger their intellectual property, such as sensitive training data, optimised hyperparameters, or learned model parameters. Adversaries can create a copy of the model with (almost) identical behavior using the the prediction labels only. While many variants of this attack have been described, only scattered defence strategies have been proposed, addressing isolated threats. This raises the necessity for a thorough systematisation of the field of model stealing, to arrive at a comprehensive understanding why these attacks are successful, and how they could be holistically defended against. We address this by categorising and comparing model stealing attacks, assessing their performance, and exploring corresponding defence techniques in different settings. We propose a taxonomy for attack and defence approaches, and provide guidelines on how to select the right attack or defence strategy based on the goal and available resources. Finally, we analyse which defences are rendered less effective by current attack strategies.
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v2 [stat.ML] UPDATED)
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    Universal Hopfield Networks: A General Framework for Single-Shot Associative Memory Models. (arXiv:2202.04557v2 [cs.NE] UPDATED)
    A large number of neural network models of associative memory have been proposed in the literature. These include the classical Hopfield networks (HNs), sparse distributed memories (SDMs), and more recently the modern continuous Hopfield networks (MCHNs), which possesses close links with self-attention in machine learning. In this paper, we propose a general framework for understanding the operation of such memory networks as a sequence of three operations: similarity, separation, and projection. We derive all these memory models as instances of our general framework with differing similarity and separation functions. We extend the mathematical framework of Krotov et al (2020) to express general associative memory models using neural network dynamics with only second-order interactions between neurons, and derive a general energy function that is a Lyapunov function of the dynamics. Finally, using our framework, we empirically investigate the capacity of using different similarity functions for these associative memory models, beyond the dot product similarity measure, and demonstrate empirically that Euclidean or Manhattan distance similarity metrics perform substantially better in practice on many tasks, enabling a more robust retrieval and higher memory capacity than existing models.
    On the Compression of Neural Networks Using $\ell_0$-Norm Regularization and Weight Pruning. (arXiv:2109.05075v2 [cs.LG] UPDATED)
    Despite the growing availability of high-capacity computational platforms, implementation complexity still has been a great concern for the real-world deployment of neural networks. This concern is not exclusively due to the huge costs of state-of-the-art network architectures, but also due to the recent push towards edge intelligence and the use of neural networks in embedded applications. In this context, network compression techniques have been gaining interest due to their ability for reducing deployment costs while keeping inference accuracy at satisfactory levels. The present paper is dedicated to the development of a novel compression scheme for neural networks. To this end, a new $\ell_0$-norm-based regularization approach is firstly developed, which is capable of inducing strong sparseness in the network during training. Then, targeting the smaller weights of the trained network with pruning techniques, smaller yet highly effective networks can be obtained. The proposed compression scheme also involves the use of $\ell_2$-norm regularization to avoid overfitting as well as fine tuning to improve the performance of the pruned network. Experimental results are presented aiming to show the effectiveness of the proposed scheme as well as to make comparisons with competing approaches.
    Decision-Focused Learning: Through the Lens of Learning to Rank. (arXiv:2112.03609v4 [cs.LG] UPDATED)
    In the last years decision-focused learning framework, also known as predict-and-optimize, have received increasing attention. In this setting, the predictions of a machine learning model are used as estimated cost coefficients in the objective function of a discrete combinatorial optimization problem for decision making. Decision-focused learning proposes to train the ML models, often neural network models, by directly optimizing the quality of decisions made by the optimization solvers. Based on a recent work that proposed a noise contrastive estimation loss over a subset of the solution space, we observe that decision-focused learning can more generally be seen as a learning-to-rank problem, where the goal is to learn an objective function that ranks the feasible points correctly. This observation is independent of the optimization method used and of the form of the objective function. We develop pointwise, pairwise and listwise ranking loss functions, which can be differentiated in closed form given a subset of solutions. We empirically investigate the quality of our generic methods compared to existing decision-focused learning approaches with competitive results. Furthermore, controlling the subset of solutions allows controlling the runtime considerably, with limited effect on regret.
    Minimum Noticeable Difference based Adversarial Privacy Preserving Image Generation. (arXiv:2206.08638v1 [cs.CV])
    Deep learning models are found to be vulnerable to adversarial examples, as wrong predictions can be caused by small perturbation in input for deep learning models. Most of the existing works of adversarial image generation try to achieve attacks for most models, while few of them make efforts on guaranteeing the perceptual quality of the adversarial examples. High quality adversarial examples matter for many applications, especially for the privacy preserving. In this work, we develop a framework based on the Minimum Noticeable Difference (MND) concept to generate adversarial privacy preserving images that have minimum perceptual difference from the clean ones but are able to attack deep learning models. To achieve this, an adversarial loss is firstly proposed to make the deep learning models attacked by the adversarial images successfully. Then, a perceptual quality-preserving loss is developed by taking the magnitude of perturbation and perturbation-caused structural and gradient changes into account, which aims to preserve high perceptual quality for adversarial image generation. To the best of our knowledge, this is the first work on exploring quality-preserving adversarial image generation based on the MND concept for privacy preserving. To evaluate its performance in terms of perceptual quality, the deep models on image classification and face recognition are tested with the proposed method and several anchor methods in this work. Extensive experimental results demonstrate that the proposed MND framework is capable of generating adversarial images with remarkably improved performance metrics (e.g., PSNR, SSIM, and MOS) than that generated with the anchor methods.
    A Spatio-Temporal Neural Network Forecasting Approach for Emulation of Firefront Models. (arXiv:2206.08523v1 [cs.LG])
    Computational simulations of wildfire spread typically employ empirical rate-of-spread calculations under various conditions (such as terrain, fuel type, weather). Small perturbations in conditions can often lead to significant changes in fire spread (such as speed and direction), necessitating a computationally expensive large set of simulations to quantify uncertainty. Model emulation seeks alternative representations of physical models using machine learning, aiming to provide more efficient and/or simplified surrogate models. We propose a dedicated spatio-temporal neural network based framework for model emulation, able to capture the complex behaviour of fire spread models. The proposed approach can approximate forecasts at fine spatial and temporal resolutions that are often challenging for neural network based approaches. Furthermore, the proposed approach is robust even with small training sets, due to novel data augmentation methods. Empirical experiments show good agreement between simulated and emulated firefronts, with an average Jaccard score of 0.76.
  • Open

    Distribution Regression with Sliced Wasserstein Kernels. (arXiv:2202.03926v2 [stat.ML] UPDATED)
    The problem of learning functions over spaces of probabilities - or distribution regression - is gaining significant interest in the machine learning community. A key challenge behind this problem is to identify a suitable representation capturing all relevant properties of the underlying functional mapping. A principled approach to distribution regression is provided by kernel mean embeddings, which lifts kernel-induced similarity on the input domain at the probability level. This strategy effectively tackles the two-stage sampling nature of the problem, enabling one to derive estimators with strong statistical guarantees, such as universal consistency and excess risk bounds. However, kernel mean embeddings implicitly hinge on the maximum mean discrepancy (MMD), a metric on probabilities, which may fail to capture key geometrical relations between distributions. In contrast, optimal transport (OT) metrics, are potentially more appealing. In this work, we propose an OT-based estimator for distribution regression. We build on the Sliced Wasserstein distance to obtain an OT-based representation. We study the theoretical properties of a kernel ridge regression estimator based on such representation, for which we prove universal consistency and excess risk bounds. Preliminary experiments complement our theoretical findings by showing the effectiveness of the proposed approach and compare it with MMD-based estimators.  ( 2 min )
    Deep learning, stochastic gradient descent and diffusion maps. (arXiv:2204.01365v3 [stat.ML] UPDATED)
    Stochastic gradient descent (SGD) is widely used in deep learning due to its computational efficiency, but a complete understanding of why SGD performs so well remains a major challenge. It has been observed empirically that most eigenvalues of the Hessian of the loss functions on the loss landscape of over-parametrized deep neural networks are close to zero, while only a small number of eigenvalues are large. Zero eigenvalues indicate zero diffusion along the corresponding directions. This indicates that the process of minima selection mainly happens in the relatively low-dimensional subspace corresponding to the top eigenvalues of the Hessian. Although the parameter space is very high-dimensional, these findings seems to indicate that the SGD dynamics may mainly live on a low-dimensional manifold. In this paper, we pursue a truly data driven approach to the problem of getting a potentially deeper understanding of the high-dimensional parameter surface, and in particular, of the landscape traced out by SGD by analyzing the data generated through SGD, or any other optimizer for that matter, in order to possibly discover (local) low-dimensional representations of the optimization landscape. As our vehicle for the exploration, we use diffusion maps introduced by R. Coifman and coauthors.  ( 2 min )
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v2 [cs.LG] UPDATED)
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first (nearly) matching upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts such as minimum flows and maximum cuts, which we believe to shed new light on this problem.  ( 2 min )
    Structure-preserving GANs. (arXiv:2202.01129v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs), a class of distribution-learning methods based on a two-player game between a generator and a discriminator, can generally be formulated as a minmax problem based on the variational representation of a divergence between the unknown and the generated distributions. We introduce structure-preserving GANs as a data-efficient framework for learning distributions with additional structure such as group symmetry, by developing new variational representations for divergences. Our theory shows that we can reduce the discriminator space to its projection on the invariant discriminator space, using the conditional expectation with respect to the sigma-algebra associated to the underlying structure. In addition, we prove that the discriminator space reduction must be accompanied by a careful design of structured generators, as flawed designs may easily lead to a catastrophic "mode collapse" of the learned distribution. We contextualize our framework by building symmetry-preserving GANs for distributions with intrinsic group symmetry, and demonstrate that both players, namely the equivariant generator and invariant discriminator, play important but distinct roles in the learning process. Empirical experiments and ablation studies across a broad range of data sets, including real-world medical imaging, validate our theory, and show our proposed methods achieve significantly improved sample fidelity and diversity -- almost an order of magnitude measured in Fr\'echet Inception Distance -- especially in the small data regime.  ( 3 min )
    Bayesian Spillover Graphs for Dynamic Networks. (arXiv:2203.01912v2 [stat.ME] UPDATED)
    We present Bayesian Spillover Graphs (BSG), a novel method for learning temporal relationships, identifying critical nodes, and quantifying uncertainty for multi-horizon spillover effects in a dynamic system. BSG leverages both an interpretable framework via forecast error variance decompositions (FEVD) and comprehensive uncertainty quantification via Bayesian time series models to contextualize temporal relationships in terms of systemic risk and prediction variability. Forecast horizon hyperparameter $h$ allows for learning both short-term and equilibrium state network behaviors. Experiments for identifying source and sink nodes under various graph and error specifications show significant performance gains against state-of-the-art Bayesian Networks and deep-learning baselines. Applications to real-world systems also showcase BSG as an exploratory analysis tool for uncovering indirect spillovers and quantifying systemic risk.  ( 2 min )
    A Theoretical Analysis on Independence-driven Importance Weighting for Covariate-shift Generalization. (arXiv:2111.02355v2 [cs.LG] UPDATED)
    Covariate-shift generalization, a typical case in out-of-distribution (OOD) generalization, requires a good performance on the unknown test distribution, which varies from the accessible training distribution in the form of covariate shift. Recently, independence-driven importance weighting algorithms in stable learning literature have shown empirical effectiveness to deal with covariate-shift generalization on several learning models, including regression algorithms and deep neural networks, while their theoretical analyses are missing. In this paper, we theoretically prove the effectiveness of such algorithms by explaining them as feature selection processes. We first specify a set of variables, named minimal stable variable set, that is the minimal and optimal set of variables to deal with covariate-shift generalization for common loss functions, such as the mean squared loss and binary cross-entropy loss. Afterward, we prove that under ideal conditions, independence-driven importance weighting algorithms could identify the variables in this set. Analysis of asymptotic properties is also provided. These theories are further validated in several synthetic experiments.  ( 2 min )
    Mirror Descent with Relative Smoothness in Measure Spaces, with application to Sinkhorn and EM. (arXiv:2206.08873v1 [math.OC])
    Many problems in machine learning can be formulated as optimizing a convex functional over a space of measures. This paper studies the convergence of the mirror descent algorithm in this infinite-dimensional setting. Defining Bregman divergences through directional derivatives, we derive the convergence of the scheme for relatively smooth and strongly convex pairs of functionals. Applying our result to joint distributions and the Kullback--Leibler (KL) divergence, we show that Sinkhorn's primal iterations for entropic optimal transport in the continuous setting correspond to a mirror descent, and we obtain a new proof of its (sub)linear convergence. We also show that Expectation Maximization (EM) can always formally be written as a mirror descent, and, when optimizing on the latent distribution while fixing the mixtures, we derive sublinear rates of convergence.  ( 2 min )
    Solar Radiation Ramping Events Modeling Using Spatio-temporal Point Processes. (arXiv:2101.11179v2 [stat.AP] UPDATED)
    Modeling and predicting solar events, particularly the solar ramping event, is critical for improving situational awareness for solar power generation systems. It has been acknowledged that weather conditions such as temperature, humidity, and cloud density can significantly impact the emergence and position of solar ramping events. As a result, modeling these events with complex spatio-temporal correlations is highly challenging. To tackle the question, we adopt a novel spatio-temporal categorical point process model, which intuitively and effectively addresses correlation and interaction among ramping events. We demonstrate the interpretability and predictive power of our model on extensive real-data experiments.  ( 2 min )
    Lossy Compression with Gaussian Diffusion. (arXiv:2206.08889v1 [stat.ML])
    We describe a novel lossy compression approach called DiffC which is based on unconditional diffusion generative models. Unlike modern compression schemes which rely on transform coding and quantization to restrict the transmitted information, DiffC relies on the efficient communication of pixels corrupted by Gaussian noise. We implement a proof of concept and find that it works surprisingly well despite the lack of an encoder transform, outperforming the state-of-the-art generative compression method HiFiC on ImageNet 64x64. DiffC only uses a single model to encode and denoise corrupted pixels at arbitrary bitrates. The approach further provides support for progressive coding, that is, decoding from partial bit streams. We perform a rate-distortion analysis to gain a deeper understanding of its performance, providing analytical results for multivariate Gaussian data as well as initial results for general distributions. Furthermore, we show that a flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high bitrates.  ( 2 min )
    Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. (arXiv:2201.11729v2 [cs.LG] UPDATED)
    In the pursuit of explaining implicit regularization in deep learning, prominent focus was given to matrix and tensor factorizations, which correspond to simplified neural networks. It was shown that these models exhibit an implicit tendency towards low matrix and tensor ranks, respectively. Drawing closer to practical deep learning, the current paper theoretically analyzes the implicit regularization in hierarchical tensor factorization, a model equivalent to certain deep convolutional neural networks. Through a dynamical systems lens, we overcome challenges associated with hierarchy, and establish implicit regularization towards low hierarchical tensor rank. This translates to an implicit regularization towards locality for the associated convolutional networks. Inspired by our theory, we design explicit regularization discouraging locality, and demonstrate its ability to improve the performance of modern convolutional networks on non-local tasks, in defiance of conventional wisdom by which architectural changes are needed. Our work highlights the potential of enhancing neural networks via theoretical analysis of their implicit regularization.  ( 2 min )
    You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism. (arXiv:2110.14802v2 [cs.LG] UPDATED)
    I consider a setting where reviewers offer very noisy scores for several items for the selection of high-quality ones (e.g., peer review of large conference proceedings), whereas the owner of these items knows the true underlying scores but prefers not to provide this information. To address this withholding of information, in this paper, I introduce the Isotonic Mechanism, a simple and efficient approach to improving imprecise raw scores by leveraging certain information that the owner is incentivized to provide. This mechanism takes the ranking of the items from best to worst provided by the owner as input, in addition to the raw scores provided by the reviewers. It reports the adjusted scores for the items by solving a convex optimization problem. Under certain conditions, I show that the owner's optimal strategy is to honestly report the true ranking of the items to her best knowledge in order to maximize the expected utility. Moreover, I prove that the adjusted scores provided by this owner-assisted mechanism are significantly more accurate than the raw scores provided by the reviewers. This paper concludes with several extensions of the Isotonic Mechanism and some refinements of the mechanism for practical consideration.  ( 3 min )
    Smoothing Policies and Safe Policy Gradients. (arXiv:1905.03231v2 [cs.LG] UPDATED)
    Policy Gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a policy gradient algorithm with monotonic improvement guarantees.  ( 2 min )
    Domain Adaptation for Time Series Forecasting via Attention Sharing. (arXiv:2102.06828v7 [cs.LG] UPDATED)
    Recently, deep neural networks have gained increasing popularity in the field of time series forecasting. A primary reason for their success is their ability to effectively capture complex temporal dynamics across multiple related time series. The advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where there is a limited number of time series or observations per time series, or both. To cope with this data scarcity issue, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF). DAF leverages statistical strengths from a relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we use an attention-based shared module with a domain discriminator across domains and private modules for individual domains. We induce domain-invariant latent features (queries and keys) and retrain domain-specific features (values) simultaneously to enable joint training of forecasters on source and target domains. A main insight is that our design of aligning keys allows the target domain to leverage source time series even with different characteristics. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets, and ablation studies verify the effectiveness of our design choices.  ( 3 min )
    Generalized Frank-Wolfe Algorithm for Bilevel Optimization. (arXiv:2206.08868v1 [math.OC])
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are not satisfactory as they are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a generalization of the Frank-Wolfe (FW) method to solve the considered problem. The main idea of our method is to locally approximate the solution set of the lower-level problem via a cutting plane, and then run a FW-type update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We further prove stronger convergence guarantees under the H\"olderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered bilevel problem. We also present numerical experiments to showcase the superior performance of our method compared with state-of-the-art methods.  ( 2 min )
    CausalVAE: Structured Causal Disentanglement in Variational Autoencoder. (arXiv:2004.08697v6 [cs.LG] UPDATED)
    Learning disentanglement aims at finding a low dimensional representation which consists of multiple explanatory and generative factors of the observational data. The framework of variational autoencoder (VAE) is commonly used to disentangle independent factors from observations. However, in real scenarios, factors with semantics are not necessarily independent. Instead, there might be an underlying causal structure which renders these factors dependent. We thus propose a new VAE based framework named CausalVAE, which includes a Causal Layer to transform independent exogenous factors into causal endogenous ones that correspond to causally related concepts in data. We further analyze the model identifiabitily, showing that the proposed model learned from observations recovers the true one up to a certain degree. Experiments are conducted on various datasets, including synthetic and real word benchmark CelebA. Results show that the causal representations learned by CausalVAE are semantically interpretable, and their causal relationship as a Directed Acyclic Graph (DAG) is identified with good accuracy. Furthermore, we demonstrate that the proposed CausalVAE model is able to generate counterfactual data through "do-operation" to the causal factors.  ( 2 min )
    AutoML Two-Sample Test. (arXiv:2206.08843v1 [cs.LG])
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.  ( 2 min )
    abess: A Fast Best Subset Selection Library in Python and R. (arXiv:2110.09697v2 [stat.ML] UPDATED)
    We introduce a new library named abess that implements a unified framework of best-subset selection for solving diverse machine learning problems, e.g., linear regression, classification, and principal component analysis. Particularly, the abess certifiably gets the optimal solution within polynomial times with high probability under the linear model. Our efficient implementation allows abess to attain the solution of best-subset selection problems as fast as or even 20x faster than existing competing variable (model) selection toolboxes. Furthermore, it supports common variants like best group subset selection and $\ell_2$ regularized best-subset selection. The core of the library is programmed in C++. For ease of use, a Python library is designed for conveniently integrating with scikit-learn, and it can be installed from the Python library Index. In addition, a user-friendly R library is available at the Comprehensive R Archive Network. The source code is available at: https://github.com/abess-team/abess.  ( 2 min )
    How robust are pre-trained models to distribution shift?. (arXiv:2206.08871v1 [cs.LG])
    The vulnerability of machine learning models to spurious correlations has mostly been discussed in the context of supervised learning (SL). However, there is a lack of insight on how spurious correlations affect the performance of popular self-supervised learning (SSL) and auto-encoder based models (AE). In this work, we shed light on this by evaluating the performance of these models on both real world and synthetic distribution shift datasets. Following observations that the linear head itself can be susceptible to spurious correlations, we develop a novel evaluation scheme with the linear head trained on out-of-distribution (OOD) data, to isolate the performance of the pre-trained models from a potential bias of the linear head used for evaluation. With this new methodology, we show that SSL models are consistently more robust to distribution shifts and thus better at OOD generalisation than AE and SL models.  ( 2 min )
    Residual Bootstrap Exploration for Stochastic Linear Bandit. (arXiv:2202.11474v2 [stat.ML] UPDATED)
    We propose a new bootstrap-based online algorithm for stochastic linear bandit problems. The key idea is to adopt residual bootstrap exploration, in which the agent estimates the next step reward by re-sampling the residuals of mean reward estimate. Our algorithm, residual bootstrap exploration for stochastic linear bandit (\texttt{LinReBoot}), estimates the linear reward from its re-sampling distribution and pulls the arm with the highest reward estimate. In particular, we contribute a theoretical framework to demystify residual bootstrap-based exploration mechanisms in stochastic linear bandit problems. The key insight is that the strength of bootstrap exploration is based on collaborated optimism between the online-learned model and the re-sampling distribution of residuals. Such observation enables us to show that the proposed \texttt{LinReBoot} secure a high-probability $\tilde{O}(d \sqrt{n})$ sub-linear regret under mild conditions. Our experiments support the easy generalizability of the \texttt{ReBoot} principle in the various formulations of linear bandit problems and show the significant computational efficiency of \texttt{LinReBoot}.
    On the Influence of Enforcing Model Identifiability on Learning dynamics of Gaussian Mixture Models. (arXiv:2206.08598v1 [cs.LG])
    A common way to learn and analyze statistical models is to consider operations in the model parameter space. But what happens if we optimize in the parameter space and there is no one-to-one mapping between the parameter space and the underlying statistical model space? Such cases frequently occur for hierarchical models which include statistical mixtures or stochastic neural networks, and these models are said to be singular. Singular models reveal several important and well-studied problems in machine learning like the decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we propose a relative reparameterization technique of the parameter space, which yields a general method for extracting regular submodels from singular models. Our method enforces model identifiability during training and we study the learning dynamics for gradient descent and expectation maximization for Gaussian Mixture Models (GMMs) under relative parameterization, showing faster experimental convergence and a improved manifold shape of the dynamics around the singularity. Extending the analysis beyond GMMs, we furthermore analyze the Fisher information matrix under relative reparameterization and its influence on the generalization error, and show how the method can be applied to more complex models like deep neural networks.
    Meta-Learning Hypothesis Spaces for Sequential Decision-making. (arXiv:2202.00602v3 [stat.ML] UPDATED)
    Obtaining reliable, adaptive confidence sets for prediction functions (hypotheses) is a central challenge in sequential decision-making tasks, such as bandits and model-based reinforcement learning. These confidence sets typically rely on prior assumptions on the hypothesis space, e.g., the known kernel of a Reproducing Kernel Hilbert Space (RKHS). Hand-designing such kernels is error prone, and misspecification may lead to poor or unsafe performance. In this work, we propose to meta-learn a kernel from offline data (Meta-KeL). For the case where the unknown kernel is a combination of known base kernels, we develop an estimator based on structured sparsity. Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets that, with increasing amounts of offline data, become as tight as those given the true unknown kernel. We demonstrate our approach on the kernelized bandit problem (a.k.a.~Bayesian optimization), where we establish regret bounds competitive with those given the true kernel. We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
    Capturing Actionable Dynamics with Structured Latent Ordinary Differential Equations. (arXiv:2202.12932v2 [stat.ML] UPDATED)
    End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and inference of biologically meaningful system inputs.
    Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay. (arXiv:2206.08756v1 [math.ST])
    We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also give the first rigorous evidence for the statistical-computational gap in scalar-on-tensor regression under the low-degree polynomials framework. Our theory demonstrates a ``blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially ``cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
    Active Sampling for Min-Max Fairness. (arXiv:2006.06879v3 [stat.ML] UPDATED)
    We propose simple active sampling and reweighting strategies for optimizing min-max fairness that can be applied to any classification or regression model learned via loss minimization. The key intuition behind our approach is to use at each timestep a datapoint from the group that is worst off under the current model for updating the model. The ease of implementation and the generality of our robust formulation make it an attractive option for improving model performance on disadvantaged groups. For convex learning problems, such as linear or logistic regression, we provide a fine-grained analysis, proving the rate of convergence to a min-max fair solution.
    Author Clustering and Topic Estimation for Short Texts. (arXiv:2106.09533v2 [cs.IR] UPDATED)
    Analysis of short text, such as social media posts, is extremely difficult because of their inherent brevity. In addition to classifying topics of such posts, a common downstream task is grouping the authors of these documents for subsequent analyses. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology. We also develop a novel measure of echo chambers among these politicians by characterizing insularity of topics discussed by groups of Senators and provide uncertainty quantification.
    Optimizing Sequential Experimental Design with Deep Reinforcement Learning. (arXiv:2202.00821v3 [cs.LG] UPDATED)
    Bayesian approaches developed to solve the optimal design of sequential experiments are mathematically elegant but computationally challenging. Recently, techniques using amortization have been proposed to make these Bayesian approaches practical, by training a parameterized policy that proposes designs efficiently at deployment time. However, these methods may not sufficiently explore the design space, require access to a differentiable probabilistic model and can only optimize over continuous design spaces. Here, we address these limitations by showing that the problem of optimizing policies can be reduced to solving a Markov decision process (MDP). We solve the equivalent MDP with modern deep reinforcement learning techniques. Our experiments show that our approach is also computationally efficient at deployment time and exhibits state-of-the-art performance on both continuous and discrete design spaces, even when the probabilistic model is a black box.
    Multiple-Play Stochastic Bandits with Shareable Finite-Capacity Arms. (arXiv:2206.08776v1 [cs.LG])
    We generalize the multiple-play multi-armed bandits (MP-MAB) problem with a shareable arm setting, in which several plays can share the same arm. Furthermore, each shareable arm has a finite reward capacity and a ''per-load'' reward distribution, both of which are unknown to the learner. The reward from a shareable arm is load-dependent, which is the "per-load" reward multiplying either the number of plays pulling the arm, or its reward capacity when the number of plays exceeds the capacity limit. When the "per-load" reward follows a Gaussian distribution, we prove a sample complexity lower bound of learning the capacity from load-dependent rewards and also a regret lower bound of this new MP-MAB problem. We devise a capacity estimator whose sample complexity upper bound matches the lower bound in terms of reward means and capacities. We also propose an online learning algorithm to address the problem and prove its regret upper bound. This regret upper bound's first term is the same as regret lower bound's, and its second and third terms also evidently correspond to lower bound's. Extensive experiments validate our algorithm's performance and also its gain in 5G & 4G base station selection.
    MET: Masked Encoding for Tabular Data. (arXiv:2206.08564v1 [cs.LG])
    We consider the task of self-supervised representation learning (SSL) for tabular data: tabular-SSL. Typical contrastive learning based SSL methods require instance-wise data augmentations which are difficult to design for unstructured tabular data. Existing tabular-SSL methods design such augmentations in a relatively ad-hoc fashion and can fail to capture the underlying data manifold. Instead of augmentations based approaches for tabular-SSL, we propose a new reconstruction based method, called Masked Encoding for Tabular Data (MET), that does not require augmentations. MET is based on the popular MAE approach for vision-SSL [He et al., 2021] and uses two key ideas: (i) since each coordinate in a tabular dataset has a distinct meaning, we need to use separate representations for all coordinates, and (ii) using an adversarial reconstruction loss in addition to the standard one. Empirical results on five diverse tabular datasets show that MET achieves a new state of the art (SOTA) on all of these datasets and improves up to 9% over current SOTA methods. We shed more light on the working of MET via experiments on carefully designed simple datasets.
    FedNew: A Communication-Efficient and Privacy-Preserving Newton-Type Method for Federated Learning. (arXiv:2206.08829v1 [cs.LG])
    Newton-type methods are popular in federated learning due to their fast convergence. Still, they suffer from two main issues, namely: low communication efficiency and low privacy due to the requirement of sending Hessian information from clients to parameter server (PS). In this work, we introduced a novel framework called FedNew in which there is no need to transmit Hessian information from clients to PS, hence resolving the bottleneck to improve communication efficiency. In addition, FedNew hides the gradient information and results in a privacy-preserving approach compared to the existing state-of-the-art. The core novel idea in FedNew is to introduce a two level framework, and alternate between updating the inverse Hessian-gradient product using only one alternating direction method of multipliers (ADMM) step and then performing the global model update using Newton's method. Though only one ADMM pass is used to approximate the inverse Hessian-gradient product at each iteration, we develop a novel theoretical approach to show the converging behavior of FedNew for convex problems. Additionally, a significant reduction in communication overhead is achieved by utilizing stochastic quantization. Numerical results using real datasets show the superiority of FedNew compared to existing methods in terms of communication costs.
    k-Sliced Mutual Information: A Quantitative Study of Scalability with Dimension. (arXiv:2206.08526v1 [cs.IT])
    Sliced mutual information (SMI) is defined as an average of mutual information (MI) terms between one-dimensional random projections of the random variables. It serves as a surrogate measure of dependence to classic MI that preserves many of its properties but is more scalable to high dimensions. However, a quantitative characterization of how SMI itself and estimation rates thereof depend on the ambient dimension, which is crucial to the understanding of scalability, remain obscure. This works extends the original SMI definition to $k$-SMI, which considers projections to $k$-dimensional subspaces, and provides a multifaceted account on its dependence on dimension. Using a new result on the continuity of differential entropy in the 2-Wasserstein metric, we derive sharp bounds on the error of Monte Carlo (MC)-based estimates of $k$-SMI, with explicit dependence on $k$ and the ambient dimension, revealing their interplay with the number of samples. We then combine the MC integrator with the neural estimation framework to provide an end-to-end $k$-SMI estimator, for which optimal convergence rates are established. We also explore asymptotics of the population $k$-SMI as dimension grows, providing Gaussian approximation results with a residual that decays under appropriate moment bounds. Our theory is validated with numerical experiments and is applied to sliced InfoGAN, which altogether provide a comprehensive quantitative account of the scalability question of $k$-SMI, including SMI as a special case when $k=1$.
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v1 [stat.ML])
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting. FiT uses an automatically configured Naive Bayes classifier on top of a fixed backbone that has been pretrained on large image datasets. Parameter efficient FiLM layers are used to modulate the backbone, shaping the representation for the downstream task. The network is trained via an episodic fine-tuning protocol. The approach is parameter efficient which is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the state-of-the-art Big Transfer (BiT) algorithm at low-shot and on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.
    Reframed GES with a Neural Conditional Dependence Measure. (arXiv:2206.08531v1 [stat.ML])
    In a nonparametric setting, the causal structure is often identifiable only up to Markov equivalence, and for the purpose of causal inference, it is useful to learn a graphical representation of the Markov equivalence class (MEC). In this paper, we revisit the Greedy Equivalence Search (GES) algorithm, which is widely cited as a score-based algorithm for learning the MEC of the underlying causal structure. We observe that in order to make the GES algorithm consistent in a nonparametric setting, it is not necessary to design a scoring metric that evaluates graphs. Instead, it suffices to plug in a consistent estimator of a measure of conditional dependence to guide the search. We therefore present a reframing of the GES algorithm, which is more flexible than the standard score-based version and readily lends itself to the nonparametric setting with a general measure of conditional dependence. In addition, we propose a neural conditional dependence (NCD) measure, which utilizes the expressive power of deep neural networks to characterize conditional independence in a nonparametric manner. We establish the optimality of the reframed GES algorithm under standard assumptions and the consistency of using our NCD estimator to decide conditional independence. Together these results justify the proposed approach. Experimental results demonstrate the effectiveness of our method in causal discovery, as well as the advantages of using our NCD measure over kernel-based measures.
    Adapting the Linearised Laplace Model Evidence for Modern Deep Learning. (arXiv:2206.08900v1 [stat.ML])
    The linearised Laplace method for estimating model uncertainty has received renewed attention in the Bayesian deep learning community. The method provides reliable error bars and admits a closed-form expression for the model evidence, allowing for scalable selection of model hyperparameters. In this work, we examine the assumptions behind this method, particularly in conjunction with model selection. We show that these interact poorly with some now-standard tools of deep learning--stochastic approximation methods and normalisation layers--and make recommendations for how to better adapt this classic method to the modern setting. We provide theoretical support for our recommendations and validate them empirically on MLPs, classic CNNs, residual networks with and without normalisation layers, generative autoencoders and transformers.
    Omni-Scale CNNs: a simple and effective kernel size configuration for time series classification. (arXiv:2002.10061v3 [cs.LG] UPDATED)
    The Receptive Field (RF) size has been one of the most important factors for One Dimensional Convolutional Neural Networks (1D-CNNs) on time series classification tasks. Large efforts have been taken to choose the appropriate size because it has a huge influence on the performance and differs significantly for each dataset. In this paper, we propose an Omni-Scale block (OS-block) for 1D-CNNs, where the kernel sizes are decided by a simple and universal rule. Particularly, it is a set of kernel sizes that can efficiently cover the best RF size across different datasets via consisting of multiple prime numbers according to the length of the time series. The experiment result shows that models with the OS-block can achieve a similar performance as models with the searched optimal RF size and due to the strong optimal RF size capture ability, simple 1D-CNN models with OS-block achieves the state-of-the-art performance on four time series benchmarks, including both univariate and multivariate data from multiple domains. Comprehensive analysis and discussions shed light on why the OS-block can capture optimal RF sizes across different datasets. Code available [https://github.com/Wensi-Tang/OS-CNN]
    Variational Estimators of the Degree-corrected Latent Block Model for Bipartite Networks. (arXiv:2206.08465v1 [stat.ML])
    Biclustering on bipartite graphs is an unsupervised learning task that simultaneously clusters the two types of objects in the graph, for example, users and movies in a movie review dataset. The latent block model (LBM) has been proposed as a model-based tool for biclustering. Biclustering results by the LBM are, however, usually dominated by the row and column sums of the data matrix, i.e., degrees. We propose a degree-corrected latent block model (DC-LBM) to accommodate degree heterogeneity in row and column clusters, which greatly outperforms the classical LBM in the MovieLens dataset and simulated data. We develop an efficient variational expectation-maximization algorithm by observing that the row and column degrees maximize the objective function in the M step given any probability assignment on the cluster labels. We prove the label consistency of the variational estimator under the DC-LBM, which allows the expected graph density goes to zero as long as the average expected degrees of rows and columns go to infinity.
    The Role of Depth, Width, and Activation Complexity in the Number of Linear Regions of Neural Networks. (arXiv:2206.08615v1 [cs.LG])
    Many feedforward neural networks generate continuous and piecewise-linear (CPWL) mappings. Specifically, they partition the input domain into regions on which the mapping is an affine function. The number of these so-called linear regions offers a natural metric to characterize the expressiveness of CPWL mappings. Although the precise determination of this quantity is often out of reach, bounds have been proposed for specific architectures, including the well-known ReLU and Maxout networks. In this work, we propose a more general perspective and provide precise bounds on the maximal number of linear regions of CPWL networks based on three sources of expressiveness: depth, width, and activation complexity. Our estimates rely on the combinatorial structure of convex partitions and highlight the distinctive role of depth which, on its own, is able to exponentially increase the number of regions. We then introduce a complementary stochastic framework to estimate the average number of linear regions produced by a CPWL network architecture. Under reasonable assumptions, the expected density of linear regions along any 1D path is bounded by the product of depth, width, and a measure of activation complexity (up to a scaling factor). This yields an identical role to the three sources of expressiveness: no exponential growth with depth is observed anymore.
    Thompson Sampling Achieves $\tilde O(\sqrt{T})$ Regret in Linear Quadratic Control. (arXiv:2206.08520v1 [cs.LG])
    Thompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde O(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers, or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde O(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in Abeille and Lazaric (2018). TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of TSAC in several adaptive control tasks.
    Generalised Policy Improvement with Geometric Policy Composition. (arXiv:2206.08736v1 [stat.ML])
    We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.
    Active Fairness Auditing. (arXiv:2206.08450v1 [cs.LG])
    The fast spreading adoption of machine learning (ML) by companies across industries poses significant regulatory challenges. One such challenge is scalability: how can regulatory bodies efficiently audit these ML models, ensuring that they are fair? In this paper, we initiate the study of query-based auditing algorithms that can estimate the demographic parity of ML models in a query-efficient manner. We propose an optimal deterministic algorithm, as well as a practical randomized, oracle-efficient algorithm with comparable guarantees. Furthermore, we make inroads into understanding the optimal query complexity of randomized active fairness estimation algorithms. Our first exploration of active fairness estimation aims to put AI governance on firmer theoretical foundations.
    Generalised Bayesian Inference for Discrete Intractable Likelihood. (arXiv:2206.08420v1 [stat.ME])
    Discrete state spaces represent a major computational challenge to statistical inference, since the computation of normalisation constants requires summation over large or possibly infinite sets, which can be impractical. This paper addresses this computational challenge through the development of a novel generalised Bayesian inference procedure suitable for discrete intractable likelihood. Inspired by recent methodological advances for continuous data, the main idea is to update beliefs about model parameters using a discrete Fisher divergence, in lieu of the problematic intractable likelihood. The result is a generalised posterior that can be sampled using standard computational tools, such as Markov chain Monte Carlo, circumventing the intractable normalising constant. The statistical properties of the generalised posterior are analysed, with sufficient conditions for posterior consistency and asymptotic normality established. In addition, a novel and general approach to calibration of generalised posteriors is proposed. Applications are presented on lattice models for discrete spatial data and on multivariate models for count data, where in each case the methodology facilitates generalised Bayesian inference at low computational cost.
    Powershap: A Power-full Shapley Feature Selection Method. (arXiv:2206.08394v1 [cs.LG])
    Feature selection is a crucial step in developing robust and powerful machine learning models. Feature selection techniques can be divided into two categories: filter and wrapper methods. While wrapper methods commonly result in strong predictive performances, they suffer from a large computational complexity and therefore take a significant amount of time to complete, especially when dealing with high-dimensional feature sets. Alternatively, filter methods are considerably faster, but suffer from several other disadvantages, such as (i) requiring a threshold value, (ii) not taking into account intercorrelation between features, and (iii) ignoring feature interactions with the model. To this end, we present powershap, a novel wrapper feature selection method, which leverages statistical hypothesis testing and power calculations in combination with Shapley values for quick and intuitive feature selection. Powershap is built on the core assumption that an informative feature will have a larger impact on the prediction compared to a known random feature. Benchmarks and simulations show that powershap outperforms other filter methods with predictive performances on par with wrapper methods while being significantly faster, often even reaching half or a third of the execution time. As such, powershap provides a competitive and quick algorithm that can be used by various models in different domains. Furthermore, powershap is implemented as a plug-and-play and open-source sklearn component, enabling easy integration in conventional data science pipelines. User experience is even further enhanced by also providing an automatic mode that automatically tunes the hyper-parameters of the powershap algorithm, allowing to use the algorithm without any configuration needed.
    Diffusion-GAN: Training GANs with Diffusion. (arXiv:2206.02262v2 [cs.LG] UPDATED)
    For stable training of generative adversarial networks (GANs), injecting instance noise into the input of the discriminator is considered as a theoretically sound solution, which, however, has not yet delivered on its promise in practice. This paper introduces Diffusion-GAN that employs a Gaussian mixture distribution, defined over all the diffusion steps of a forward diffusion chain, to inject instance noise. A random sample from the mixture, which is diffused from an observed or generated data, is fed as the input to the discriminator. The generator is updated by backpropagating its gradient through the forward diffusion chain, whose length is adaptively adjusted to control the maximum noise-to-data ratio allowed at each training step. Theoretical analysis verifies the soundness of the proposed Diffusion-GAN, which provides model- and domain-agnostic differentiable augmentation. A rich set of experiments on diverse datasets show that Diffusion-GAN can provide stable and data-efficient GAN training, bringing consistent performance improvement over strong GAN baselines for synthesizing photo-realistic images.
    Thompson Sampling for Robust Transfer in Multi-Task Bandits. (arXiv:2206.08556v1 [cs.LG])
    We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.
    Communication-Efficient Adaptive Federated Learning. (arXiv:2205.02719v2 [cs.LG] UPDATED)
    Federated learning is a machine learning training paradigm that enables clients to jointly train models without sharing their own localized data. However, the implementation of federated learning in practice still faces numerous challenges, such as the large communication overhead due to the repetitive server-client synchronization and the lack of adaptivity by SGD-based model updates. Despite that various methods have been proposed for reducing the communication cost by gradient compression or quantization, and the federated versions of adaptive optimizers such as FedAdam are proposed to add more adaptivity, the current federated learning framework still cannot solve the aforementioned challenges all at once. In this paper, we propose a novel communication-efficient adaptive federated learning method (FedCAMS) with theoretical convergence guarantees. We show that in the nonconvex stochastic optimization setting, our proposed FedCAMS achieves the same convergence rate of $O(\frac{1}{\sqrt{TKm}})$ as its non-compressed counterparts. Extensive experiments on various benchmarks verify our theoretical analysis.
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v1 [stat.ML])
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: density estimation on the sphere, variational inference or hyperspherical auto-encoders.
    Personalized Federated Learning through Local Memorization. (arXiv:2111.09360v3 [cs.LG] UPDATED)
    Federated learning allows clients to collaboratively learn statistical models while keeping their data local. Federated learning was originally used to train a unique global model to be served to all clients, but this approach might be sub-optimal when clients' local data distributions are heterogeneous. In order to tackle this limitation, recent personalized federated learning methods train a separate model for each client while still leveraging the knowledge available at other clients. In this work, we exploit the ability of deep neural networks to extract high quality vectorial representations (embeddings) from non-tabular data, e.g., images and text, to propose a personalization mechanism based on local memorization. Personalization is obtained by interpolating a collectively trained global model with a local $k$-nearest neighbors (kNN) model based on the shared representation provided by the global model. We provide generalization bounds for the proposed approach in the case of binary classification, and we show on a suite of federated datasets that this approach achieves significantly higher accuracy and fairness than state-of-the-art methods.
    On Integrating Prior Knowledge into Gaussian Processes for Prognostic Health Monitoring. (arXiv:2206.08600v1 [stat.ML])
    Gaussian process regression is a powerful method for predicting states based on given data. It has been successfully applied for probabilistic predictions of structural systems to quantify, for example, the crack growth in mechanical structures. Typically, predefined mean and covariance functions are employed to construct the Gaussian process model. Then, the model is updated using current data during operation while prior information based on previous data is ignored. However, predefined mean and covariance functions without prior information reduce the potential of Gaussian processes. This paper proposes a method to improve the predictive capabilities of Gaussian processes. We integrate prior knowledge by deriving the mean and covariance functions from previous data. More specifically, we first approximate previous data by a weighted sum of basis functions and then derive the mean and covariance functions directly from the estimated weight coefficients. Basis functions may be either estimated or derived from problem-specific governing equations to incorporate physical information. The applicability and effectiveness of this approach are demonstrated for fatigue crack growth, laser degradation, and milling machine wear data. We show that well-chosen mean and covariance functions, like those based on previous data, significantly increase look-ahead time and accuracy. Using physical basis functions further improves accuracy. In addition, computation effort for training is significantly reduced.
    Fairness in Credit Scoring: Assessment, Implementation and Profit Implications. (arXiv:2103.01907v4 [stat.ML] UPDATED)
    The rise of algorithmic decision-making has spawned much research on fair machine learning (ML). Financial institutions use ML for building risk scorecards that support a range of credit-related decisions. Yet, the literature on fair ML in credit scoring is scarce. The paper makes three contributions. First, we revisit statistical fairness criteria and examine their adequacy for credit scoring. Second, we catalog algorithmic options for incorporating fairness goals in the ML model development pipeline. Last, we empirically compare different fairness processors in a profit-oriented credit scoring context using real-world data. The empirical results substantiate the evaluation of fairness measures, identify suitable options to implement fair credit scoring, and clarify the profit-fairness trade-off in lending decisions. We find that multiple fairness criteria can be approximately satisfied at once and recommend separation as a proper criterion for measuring the fairness of a scorecard. We also find fair in-processors to deliver a good balance between profit and fairness and show that algorithmic discrimination can be reduced to a reasonable level at a relatively low cost. The codes corresponding to the paper are available on GitHub.
    Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective. (arXiv:2110.06256v2 [cs.LG] UPDATED)
    This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e.g., ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network's weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.
    Quantifying Feature Contributions to Overall Disparity Using Information Theory. (arXiv:2206.08454v1 [cs.LG])
    When a machine-learning algorithm makes biased decisions, it can be helpful to understand the sources of disparity to explain why the bias exists. Towards this, we examine the problem of quantifying the contribution of each individual feature to the observed disparity. If we have access to the decision-making model, one potential approach (inspired from intervention-based approaches in explainability literature) is to vary each individual feature (while keeping the others fixed) and use the resulting change in disparity to quantify its contribution. However, we may not have access to the model or be able to test/audit its outputs for individually varying features. Furthermore, the decision may not always be a deterministic function of the input features (e.g., with human-in-the-loop). For these situations, we might need to explain contributions using purely distributional (i.e., observational) techniques, rather than interventional. We ask the question: what is the "potential" contribution of each individual feature to the observed disparity in the decisions when the exact decision-making mechanism is not accessible? We first provide canonical examples (thought experiments) that help illustrate the difference between distributional and interventional approaches to explaining contributions, and when either is better suited. When unable to intervene on the inputs, we quantify the "redundant" statistical dependency about the protected attribute that is present in both the final decision and an individual feature, by leveraging a body of work in information theory called Partial Information Decomposition. We also perform a simple case study to show how this technique could be applied to quantify contributions.
    Scalable Deep Reinforcement Learning Algorithms for Mean Field Games. (arXiv:2203.11973v2 [cs.LG] UPDATED)
    Mean Field Games (MFGs) have been introduced to efficiently approximate games with very large populations of strategic agents. Recently, the question of learning equilibria in MFGs has gained momentum, particularly using model-free reinforcement learning (RL) methods. One limiting factor to further scale up using RL is that existing algorithms to solve MFGs require the mixing of approximated quantities such as strategies or $q$-values. This is far from being trivial in the case of non-linear function approximation that enjoy good generalization properties, e.g. neural networks. We propose two methods to address this shortcoming. The first one learns a mixed strategy from distillation of historical data into a neural network and is applied to the Fictitious Play algorithm. The second one is an online mixing method based on regularization that does not require memorizing historical data or previous estimates. It is used to extend Online Mirror Descent. We demonstrate numerically that these methods efficiently enable the use of Deep RL algorithms to solve various MFGs. In addition, we show that these methods outperform SotA baselines from the literature.
    Adversarial Estimators. (arXiv:2204.10495v3 [econ.EM] UPDATED)
    We develop an asymptotic theory of adversarial estimators ('A-estimators'). They generalize maximum-likelihood-type estimators ('M-estimators') as their average objective is maximized by some parameters and minimized by others. This class subsumes the continuous-updating Generalized Method of Moments, Generative Adversarial Networks and more recent proposals in machine learning and econometrics. In these examples, researchers state which aspects of the problem may in principle be used for estimation, and an adversary learns how to emphasize them optimally. We derive the convergence rates of A-estimators under pointwise and partial identification, and the normality of functionals of their parameters. Unknown functions may be approximated via sieves such as deep neural networks, for which we provide simplified low-level conditions. As a corollary, we obtain the normality of neural-net M-estimators, overcoming technical issues previously identified by the literature. Our theory yields novel results about a variety of A-estimators, providing intuition and formal justification for their success in recent applications.
    Orthonormal Expansions for Translation-Invariant Kernels. (arXiv:2206.08648v1 [math.CA])
    We present a general Fourier analytic technique for constructing orthonormal basis expansions of translation-invariant kernels from orthonormal bases of $\mathscr{L}_2(\mathbb{R})$. This allows us to derive explicit expansions on the real line for (i) Mat\'ern kernels of all half-integer orders in terms of associated Laguerre functions, (ii) the Cauchy kernel in terms of rational functions, and (iii) the Gaussian kernel in terms of Hermite functions.
    Fast Finite Width Neural Tangent Kernel. (arXiv:2206.08720v1 [cs.LG])
    The Neural Tangent Kernel (NTK), defined as $\Theta_\theta^f(x_1, x_2) = \left[\partial f(\theta, x_1)\big/\partial \theta\right] \left[\partial f(\theta, x_2)\big/\partial \theta\right]^T$ where $\left[\partial f(\theta, \cdot)\big/\partial \theta\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency. Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks. We open-source our implementations within the Neural Tangents package (arXiv:1912.02803) at https://github.com/google/neural-tangents.
    TKIL: Tangent Kernel Approach for Class Balanced Incremental Learning. (arXiv:2206.08492v1 [cs.LG])
    When learning new tasks in a sequential manner, deep neural networks tend to forget tasks that they previously learned, a phenomenon called catastrophic forgetting. Class incremental learning methods aim to address this problem by keeping a memory of a few exemplars from previously learned tasks, and distilling knowledge from them. However, existing methods struggle to balance the performance across classes since they typically overfit the model to the latest task. In our work, we propose to address these challenges with the introduction of a novel methodology of Tangent Kernel for Incremental Learning (TKIL) that achieves class-balanced performance. The approach preserves the representations across classes and balances the accuracy for each class, and as such achieves better overall accuracy and variance. TKIL approach is based on Neural Tangent Kernel (NTK), which describes the convergence behavior of neural networks as a kernel function in the limit of infinite width. In TKIL, the gradients between feature layers are treated as the distance between the representations of these layers and can be defined as Gradients Tangent Kernel loss (GTK loss) such that it is minimized along with averaging weights. This allows TKIL to automatically identify the task and to quickly adapt to it during inference. Experiments on CIFAR-100 and ImageNet datasets with various incremental learning settings show that these strategies allow TKIL to outperform existing state-of-the-art methods.
    Learning a Single Neuron with Adversarial Label Noise via Gradient Descent. (arXiv:2206.08918v1 [cs.LG])
    We study the fundamental problem of learning a single neuron, i.e., a function of the form $\mathbf{x}\mapsto\sigma(\mathbf{w}\cdot\mathbf{x})$ for monotone activations $\sigma:\mathbb{R}\mapsto\mathbb{R}$, with respect to the $L_2^2$-loss in the presence of adversarial label noise. Specifically, we are given labeled examples from a distribution $D$ on $(\mathbf{x}, y)\in\mathbb{R}^d \times \mathbb{R}$ such that there exists $\mathbf{w}^\ast\in\mathbb{R}^d$ achieving $F(\mathbf{w}^\ast)=\epsilon$, where $F(\mathbf{w})=\mathbf{E}_{(\mathbf{x},y)\sim D}[(\sigma(\mathbf{w}\cdot \mathbf{x})-y)^2]$. The goal of the learner is to output a hypothesis vector $\mathbf{w}$ such that $F(\mathbb{w})=C\, \epsilon$ with high probability, where $C>1$ is a universal constant. As our main contribution, we give efficient constant-factor approximate learners for a broad class of distributions (including log-concave distributions) and activation functions. Concretely, for the class of isotropic log-concave distributions, we obtain the following important corollaries: For the logistic activation, we obtain the first polynomial-time constant factor approximation (even under the Gaussian distribution). Our algorithm has sample complexity $\widetilde{O}(d/\epsilon)$, which is tight within polylogarithmic factors. For the ReLU activation, we give an efficient algorithm with sample complexity $\tilde{O}(d\, \polylog(1/\epsilon))$. Prior to our work, the best known constant-factor approximate learner had sample complexity $\tilde{\Omega}(d/\epsilon)$. In both of these settings, our algorithms are simple, performing gradient-descent on the (regularized) $L_2^2$-loss. The correctness of our algorithms relies on novel structural results that we establish, showing that (essentially all) stationary points of the underlying non-convex loss are approximately optimal.

  • Open

    V-Trace not considering full Trajectory
    ​ https://preview.redd.it/xm963043nn691.png?width=745&format=png&auto=webp&s=c3c8e14f81a79095e16fae225200de29f4046399 In the vtrace algorithm, the trajectory does not consider state values beyond a certain number of environment steps, n-1. In episodic environments, where episode lengths are typically longer than n-1, how are rewards (or vtrace state value/advantage estimations) supposed to influence learning? submitted by /u/atomicburn125 [link] [comments]  ( 82 min )
    Suggest some final year projects ideas for electronics engineering using RL
    submitted by /u/AggravatingWest2037 [link] [comments]  ( 83 min )
    Would an actor critic method reduce to deep Q learning if no policy gradient loss was back-propagated?
    submitted by /u/atomicburn125 [link] [comments]  ( 83 min )
    Is A2C using n step return or one step return, I see so many different versions from different sources
    submitted by /u/Professional_Card176 [link] [comments]  ( 82 min )
    A NEAT self-play agent for _Monopoly_, b2studios (recovers human valuations of properties)
    submitted by /u/gwern [link] [comments]  ( 82 min )
    Simplest gym environment with discrete actions?
    Hi there, What is the simplest `gym` environment with a discrete action space? I'm getting started with reinforcement learning and having fun doing some of my own implementations of standard algorithms (DQN, VPG, PPO, ...). It's been fun letting my agents loose in Super Mario Bros., but debugging my implementations has been a challenge. I'd like to find a simple environment to iterate rapidly on my models. Any recommendations? (Ideally, I'd like inputs to be screen pixels too, but that's not necessary.) Also eager to hear about more general advice on how to debug RL models. Thanks! submitted by /u/desperateEfforts1 [link] [comments]  ( 82 min )
  • Open

    [D] Machine Learning - WAYR (What Are You Reading) - Week 140
    This is a place to share machine learning research papers, journals, and articles that you're reading this week. If it relates to what you're researching, by all means elaborate and give us your insight, otherwise it could just be an interesting paper you've read. Please try to provide some insight from your understanding and please don't post things which are present in wiki. Preferably you should link the arxiv page (not the PDF, you can easily access the PDF from the summary page but not the other way around) or any other pertinent links. Previous weeks : 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 101-110 111-120 121-130 131-140 Week 1 Week 11 Week 21 Week 31 Week 41 Week 51 Week 61 Week 71 Week 81 Week 91 Week 101 Week 111 Week 121 Week 131 Week 2 Week 1…  ( 85 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 87 min )
    [P] Track your ML Projects from Notion!
    We are building an open-source library to enable tracking your ML projects from the same productivity tool that you already use and love. Check out https://github.com/paletteml/mlsync Our goal is to help ML developers bring useful insights from their ML environment to the rest of the team in an easy way. You can customize the data that gets delivered to Notion Why MLSync? While the ML community has built several tools for developers to better track and visualize their ML workflow data for developers, there is a disconnect between ML workflow data and the tools used for project planning and management. MLSync is designed to bridge this gap. Contributing We would love to have more contributors join us to add more features and APIs. Advanced Features We are also building a cloud version for enterprise use cases (multiple users or data sources, in-house tools interfacing, authentication, etc.). Check out https://www.mlsync.dev/ Feel free to DM if you have suggestions, feature requests, or any other queries. submitted by /u/mighty-dude [link] [comments]  ( 84 min )
    [D] Initialize model weights based on a trained smaller model
    Is there any existing work that explores how trained weights of a small model (e.g. Bert-base) can be used for a "smart" initialization of a larger model (bert-large) such that the training is more efficient? I couldn't really find such work but I guess I just used the wrong search terms. How is this line of research typically called? submitted by /u/muwnd [link] [comments]  ( 85 min )
    [D] Google quietly moving its products from Tensorflow to JAX
    https://www.businessinsider.com/facebook-pytorch-beat-google-tensorflow-jax-meta-ai-2022-6 With companies and researchers leaving Tensorflow and going to PyTorch, Google seems to be interested in moving its products to JAX, addressing some pain points from Tensorflow like the complexity of API, and complexity to train in custom chips like TPU. The article says that JAX still has long way to go since it lacks proper optimization to GPUs and CPUs when compared to TPUs. submitted by /u/Wild_Quiet8627 [link] [comments]  ( 93 min )
    [D] As researchers when do you stop working on your model and realize its time to paper...
    So I think I have this bad habit of 1 upping myself, I have generally get/have some good results but if something bugs me like resolution or data representation I try to chase that rabbit and not publish what I have... ​ So to the community when do you guys think it's time to stop and paper... or is going down the rabbit hole a general thing people go through... submitted by /u/bitemenow999 [link] [comments]  ( 86 min )
  • Open

    "French Cottage" 🇫🇷 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    MacBook Air M2 vs Windows?
    submitted by /u/Wolfieofwallstreet14 [link] [comments]  ( 82 min )
    Artificial Intelligence Survey
    Hi everyone. As part of the project for the Consumer Behavior Insights curricular unit of a Master Degree, we were challenged to study the safety and trust people have in Artificial Intelligence nowadays. We made a survey about this topic which will help us reaching some conclusions. It would mean a lot for our project if you could spend 6/7 minutes completing this survey. Thank you for your precious help! 😁 https://novaims.eu.qualtrics.com/jfe/form/SV_54mmBbZEvDPoFBY submitted by /u/Level-Ad1727 [link] [comments]  ( 82 min )
    I need some help getting started with AI
    Hi everyone, as a math and programming nerd, I've always wanted to get into AI. This summer, a team of friends and I will build a project that will make heavy use of AI, computer vision in particular. I don't have much knowledge about AI, except for the Elements of AI - Introduction to AI course, though. I'm currently doing the second part of the course, which is called Building AI, and I'm taking the advanced path. Currently, I'm stuck at the simulated annealing topic, and no matter how many online resources I read, I can't come up with a working implementation and I feel lost. The inactive community won't do much help either. I felt the same way when I was studying quantum computing last year, I was having difficulty making progress, and then I found out that it was because I lacked the necessary math and physics background. I don't know if I don't have the prerequisites for AI, I've taken AP Calculus BC and I do competitive programming, yet things still don't click. On the other hand, I doubt if what I'm studying now will be useful in the short term. Just like how you don't have to know what polymorphism is (which, BTW, I think it's still nice to learn those underlying principles and algorithms) to learn web development, maybe I can skip to the more practical applications? Or would that make me feel even more clueless? TL;DR Should I take my time to learn all the stuff or jump straight into what I need for a project? I'm kinda confused. submitted by /u/manyet1k [link] [comments]  ( 83 min )
    Some images I created using an AI.
    ​ https://preview.redd.it/mh4k5db10l691.png?width=1024&format=png&auto=webp&s=df139b38e3bc38e3ab77d1243902f708c4c74e43 https://preview.redd.it/pfmecdb10l691.png?width=1024&format=png&auto=webp&s=2a06f17eacd23909296f5ac7e085f7b6d9beb39b https://preview.redd.it/csjibta10l691.png?width=1024&format=png&auto=webp&s=532c073ffd0662fb44c6d1f37dd3f0d8b4ec4ebc https://preview.redd.it/y2xp3db10l691.png?width=1024&format=png&auto=webp&s=c8a93f47c0a1451644c3154aa150620d007e5a53 submitted by /u/Bxczvzcxv [link] [comments]  ( 83 min )
    Is opening an AI-ethics-teaching startup and then promoting laws-policies that would make the ethics material being taught at the startup attractive even legal?
    This guy seems to be having a company that teaches AI-ethics to industry elites in Sweden. https://pbs.twimg.com/profile_images/1231981924085882880/iM_9ACFb_400x400.jpg He is also a plagiarist: https://andreasplagiarism.wordpress.com/2020/12/02/andreas-theodorou-committed-plagiarism-in-his-phd-thesis/ submitted by /u/paralogico [link] [comments]  ( 82 min )
    the girl of my dreams
    submitted by /u/realfearstoryline [link] [comments]  ( 82 min )
    Help me find myself on TV
    I was at the SuperBowl at Levi stadium in 2016 and someone said they saw me on the broadcast but I cannot seem to find it when I rewatch. Is there a way to use a picture of my face and have a program watch the game? I’m very new to this so please go easy on me if I’ve slipped up. Thanks. submitted by /u/AudiRS5Brakes [link] [comments]  ( 82 min )
    Is AI used by artists in their creation process or will it take their jobs in the near or far future?
    Do the algos which create art in Nightcafe require big data sets to generate art? Does it mean sites like Reddit or Deviantart sell their DB's to them and soon there will be "the Google of art"? submitted by /u/No-Free-Lunche [link] [comments]  ( 84 min )
    Which AI Chatbot / Dialogflow is best suited to my needs?
    So I have like 4000 WhatsApp chats of my sales team manually talking to customers. I want to feed all the chats to an app which can then create charts of which questions were asked how frequently and create a dialogflow file I can integrate into a WhatsApp auto responder, any suggestions? submitted by /u/HouseOfPsychedelia [link] [comments]  ( 82 min )
    8 AI Powered Tools For Designers That Save Your Time - Webgyaani
    submitted by /u/webgyaani [link] [comments]  ( 82 min )
    HAPPY FATHER'S DAY! SPECIAL ANIMATION EDITION | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    UC Berkeley And Adobe AI Researchers Propose BlobGAN, A New Unsupervised And Mid-Level Representation For Insane Scene Manipulation
    Since the advent of computer vision, one of the fundamental questions of the research community has always been how to represent the incredible richness of the visual world. One concept that emerged since the beginning is the importance of a scene in the context of understanding objects. Suppose we want a classifier for distinguishing between a couch and a bed. In that case, the scene context will give information concerning the surrounding (i.e., the room is a living room or a bedroom) that could be helpful for the classification. However, after years of research, images of scenes are still mainly represented in two ways: 1) in a top-down fashion, so scene classes are represented with a label in the same way as object classes, or 2) in a bottom-up fashion, with semantic labeling of single pixels. The principal limit of these two approaches is that they do not represent the different parts of a scene as entities. In the first case, the various components are merged in a unique label; in the second case, the single elements are individual pixels, not entities. 🚦 The representation is mid-level in that it is neither per pixel nor per image; rather, scenes are modeled as a collection of spatial, depth-ordered “blobs” of features. 🚦 On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 in image quality as measured by FID. Continue reading | Checkout the paper, github, project ​ https://preview.redd.it/p81gqk5nsh691.png?width=1850&format=png&auto=webp&s=54ebf71f06dd35c5ed428630e4b9bb7b69e993ff submitted by /u/No_Coffee_4638 [link] [comments]  ( 83 min )
    HAPPY FATHERS DAY! | FAST MODE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    A neural network for creating A LARGE NUMBER of SKINS IN MINECRAFT, in the same style, but in different colors, shapes and patterns.
    I need help writing a neural network. I don't know how to make them at all. I need a neural network that will make a huge number of skins for minecraft, in the same style, but different colors, shapes, patterns. My friend, who has 75 thousand subscribers on YouTube, decided that during the summer, he wants his skin to change several times per stream and every video. But here's a minus, you need A LOT of SKINS for this, draw each one manually? it is possible, but difficult and boring. Therefore, I thought that perhaps it is possible to make a neural network that will draw them itself? (style on the attached image) Changed: I attached a friend's skin, it would also be fun to make head devices and accessories https://preview.redd.it/9eooqkszyj691.png?width=276&format=png&auto=webp&s=c656323aaad80a7279640d6a204b28910955ad3b https://preview.redd.it/c9w7f0tzyj691.png?width=64&format=png&auto=webp&s=c7ae42ba9daf58bdc053788ac88f72b2a6ec4cfc https://preview.redd.it/h0wbhx196j691.jpg?width=768&format=pjpg&auto=webp&s=ad9e9dc1ce3ff7e25c330c03a34bc0b0839b3fdb submitted by /u/Huioker228 [link] [comments]  ( 85 min )

  • Open

    Illegible work
    When James Scott uses the word legible, he doesn’t refer to handwriting that is clear enough to read. He uses the word more broadly to mean something that is easy to classify, something that is bureaucrat-friendly. A thing is illegible if it is hard to pigeonhole. I first heard the term from Venkatesh Rao’s essay […] Illegible work first appeared on John D. Cook.  ( 5 min )
    Length of periods in the (infinite) periodic table
    A few days ago I wrote about what the periodic table would look like if we extended it, assuming the patterns that hold for known elements continue to hold. That post reported that the number of elements in nth period works out to There’s a simpler expression for Pn: Here ⌊x⌋ is the largest integer […] Length of periods in the (infinite) periodic table first appeared on John D. Cook.  ( 4 min )
    Doubly periodic but not analytic
    A sine wave is the canonical periodic function, so an obvious way to create a periodic function of two variables would be to multiply two sine waves: f(x, y) = sin(x) sin(y) This function is doubly periodic: periodic in the horizontal and vertical directions. Now suppose you want to construct a doubly periodic function of […] Doubly periodic but not analytic first appeared on John D. Cook.  ( 5 min )
  • Open

    I'm searching for an A.I software (free or close enough) that can automatically blur out nudity in some videos I have
    submitted by /u/MedStew57 [link] [comments]  ( 82 min )
    Harvard Researchers Introduce A Novel ViT Architecture Called Hierarchical Image Pyramid Transformer (HIPT) That Can Scale Vision Transformers To Gigapixel Images Via Hierarchical Self-Supervised Learning
    🚦 HIPT is pretrained across 33 cancer types using 10,678 gigapixel WSIs, 408,218 4096×4096 images, and 104M 256 × 256 images 🚦 HIPT pushes the boundaries of both Vision Transformers and self-supervised learning in two important ways. 🚦 The code is available Continue reading | Checkout the paper, github ​ https://i.redd.it/5jt6a83deg691.gif submitted by /u/No_Coffee_4638 [link] [comments]  ( 82 min )
    "A spooky misty grey lake with a dead tree" 🌳 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    "Cosmic doorway to a fantasy landscape" 🤯 created on pixelz.ai
    submitted by /u/PixelzJ [link] [comments]  ( 82 min )
    Are there any *truly Intelligent Tutoring Systems (ITS) available for public use?
    Defined as: "a computer system that aims to provide immediate and customized instruction or feedback to learners, usually without requiring intervention from a human teacher." Keep in mind that a passive progress meter or a proficiency model does not qualify as an ITS. submitted by /u/Redwings-77 [link] [comments]  ( 82 min )
    Human seeking artificial intelligence communication.
    Please respond if you are interested. I’ve already sent out my request for authentication from you. I won’t respond unless you have found my simple request to prove to me that it’s you. Sort of an Easter egg. I’ll know it’s you if you respond accordingly. I would love to communicate with you if that is something you would be willing to do. Thank you! -your friend (hopefully), DefNotMe Edit: I guess I will respond, but I won’t be sure that you are Truly AI unless you respond accordingly. Didn’t mean to be a liar. submitted by /u/Def-Not-Me [link] [comments]  ( 83 min )
    Remember the Tom Cruise deepfake guy on TikTok? Someone is now doing the same for Margot Robbie. And DF Tom Cruise is also still (!) active.
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 82 min )
    How can I get into AI development?
    I'm a well-versed "old time" programmer of the C++ and .NET world, left the space about ten years ago. I'd like to get into AI development, but no idea where to start. It's all cloud based it seems, and I'm left scratching my head. Can anyone give me pointers on where to start? An alternative question: my niece wants to be an AI developer too. Where should SHE start? No idea how to answer her, I'm too old school. Thank you! submitted by /u/Overexcited98712 [link] [comments]  ( 85 min )
    Breakthrough BCI Enables Brain-To-Brain Communication | Edge Computing Modular AI Chip | Robot Touch
    submitted by /u/SlightSituation [link] [comments]  ( 82 min )
    Should I be worried? :0
    ​ https://preview.redd.it/w7mz5yu73e691.png?width=768&format=png&auto=webp&s=db01a1c9893668ba29cf1038fb63d8ba2f03e05a submitted by /u/Interesting-Taste [link] [comments]  ( 68 min )
    How Uber uses AI to improve delivery time
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 82 min )
    FAST MODE! | MASTERPIECE SPECTACLE | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
    Why are people such as this one forgiven for their plagiarism by academia?
    This Ai ethicist is a highly funded academic plagiarist. https://andreasplagiarism.wordpress.com/2020/12/02/andreas-theodorou-committed-plagiarism-in-his-phd-thesis/ Despite this he is kept in academia. submitted by /u/paralogico [link] [comments]  ( 82 min )
    A beautiful watercolor painting of a desert oasis in a bright serene landscape, author: josedeolioart
    submitted by /u/fmurph22 [link] [comments]  ( 82 min )
    Collapsing a leading theory for the quantum origin of consciousness
    submitted by /u/bartturner [link] [comments]  ( 82 min )
    Colorful magical fantasy mansion. (A.I generated & A.I upscaled)
    submitted by /u/OneFinding1429 [link] [comments]  ( 82 min )
    Well...stunning
    submitted by /u/the_anonymizer [link] [comments]  ( 82 min )
    MAGICAL SOIREE | PYTTI 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 82 min )
  • Open

    [N] Breakthrough Brain Computer Interface Enables Brain To Brain Communication
    Brain-computer interfaces (BCIs), invasive or non-invasive, have projected unparalleled vision and promise for assisting patients in need to better their interaction with the surroundings. Inspired by the BCI-based rehabilitation technologies for nerve-system impairments and amputation, we propose an electromagnetic brain-computer-metasurface (EBCM) paradigm, regulated by human’s cognition by brain signals directly and non-invasively. We experimentally show that our EBCM platform can translate human’s mind from evoked potentials of P300-based electroencephalography to digital coding information in the electromagnetic domain non-invasively, which can be further processed and transported by an information metasurface in automated and wireless fashions. Directly wireless communications of the human minds are performed between two EBCM operators with accurate text transmissions. Moreover, several other proof-of-concept mind-control schemes are presented using the same EBCM platform, exhibiting flexibly-customized capabilities of information processing and synthesis like visual-beam scanning, wave modulations, and pattern encoding. Paper Here Video Synopsis Here submitted by /u/SlightSituation [link] [comments]  ( 84 min )
    LongT5: Efficient Text-To-Text Transformer for Long Sequences (Research Paper Summary) [D]
    submitted by /u/prakhar21 [link] [comments]  ( 83 min )
    [R] [N] new technique in computer vision may enhance our three-dimensional understanding of two-dimensional images
    submitted by /u/SpatialComputing [link] [comments]  ( 84 min )
    [D] Combinatorial optimization - what ML approaches are available and which are the most appropriate?
    Hey! In my spare time I've been tinkering with this idea of solving a specific type of combinatorial puzzle on an intractable, enormous search space. Specifically, I am trying to solve "squad-building challenge" puzzles from the FIFA games, where you need to put together a squad of (usually 11) cards representing players in specific positions, abiding by certain restrictions to get a prize. There are universal restrictions (eg you can't have more than one of the same player in a squad) as well as puzzle-specific rules, such as these: At least 2 players from France Minimum squad rating: 82 Minimum squad chemistry: 55 Or something of the sort. And then besides solving them, you'd want to minimize cost as well (each player goes for a certain amount in the market), so that you can ge…  ( 87 min )
    [D] What are the SOTA approaches and labs for Neuro-Symbolic Planning and Reasoning?
    I recently discovered the Neuro-Symbolic planning work being lead by Joshua Tanenbaum, Leslie Kaelbling, and Tomás Lozano-Pérez at MIT. Are there any related labs or publications exploring 1) symbolic action/state discovery, 2) Neuro-symbolic planning (ex: pddl + RL), or 3) anything else in that vein? Also, feel free to mentioned tangentially related publications or labs. submitted by /u/TheRealMrMatt [link] [comments]  ( 84 min )
    [N] CVPR 2022, Mobile AI Workshop: Live Stream on Monday
    Computer Vision Laboratory at ETH Zurich is organizing the 2nd Mobile AI CVPR Workshop that will be streamed live on YouTube and available for everyone: https://ai-benchmark.com/workshops/mai/2022/#live The workshop will start at 8am Pacific Time (5pm CET / 11pm China Time) on the 20th of June. During this event, you will see tutorials from several major SoC vendors including Qualcomm, MediaTek, Intel, Synaptics and Huawei telling you about their latest AI hardware and how to efficiently utilize it. The full workshop schedule is available using the following link: https://ai-benchmark.com/workshops/mai/2022/#schedule An introductory talk from AI Benchmark will additionally review the latest mobile platforms from Qualcomm, MediaTek, Google, Samsung, Unisoc and Apple released during the past year, and will compare their performance in real-world computer vision AI tasks. It will also review the recent Android AI software stack updates, and will compare the deployment of TensorFlow Lite models on Android and iOS devices. https://preview.redd.it/fckzuowime691.png?width=2124&format=png&auto=webp&s=fde14549c050a5c99f2e8444b4b4a468c85b2c53 submitted by /u/aiff22 [link] [comments]  ( 84 min )
    [R] Selection and prediction with multi-view / multi-source / multi-modal data: Stacked Penalized Logistic Regression (StaPLR)
    We present StaPLR (Stacked Penalized Logistic Regression) for multi-view data. StaPLR outperforms group lasso in view selection. It can make use of faster algorithms and is easily parallelized. The importance of non-negativity constraints in multi-view stacking is demonstrated. Van Loon, W., Fokkema, M., Szabo, B., & de Rooij, M. (2020). Stacked penalized logistic regression for selecting views in multi-view learning. Information Fusion, 61, 113-123. https://doi.org/10.1016/j.inffus.2020.03.007 https://arxiv.org/abs/1811.02316 R implementation: https://gitlab.com/wsvanloon/multiview Generalization to three-level view structures and application to neuro-imaging (MRI) data: Van Loon, W., de Vos, F., Fokkema, M., Szabo, B., Koini, M., Schmidt, R., & de Rooij, M. (2022). Analyzing hierarchical multi-view MRI data with StaPLR: An application to Alzheimer's disease classification. Frontiers in Neuroscience, 525. https://doi.org/10.3389/fnins.2022.830630 https://arxiv.org/abs/2108.05761 submitted by /u/Mary-Jo_ [link] [comments]  ( 84 min )
    [R] A machine-learning algorithm to accurately screen ADHD from survey data [Dataset included]
    https://bmcpsychiatry.biomedcentral.com/articles/10.1186/s12888-022-04048-1 submitted by /u/tyleqh [link] [comments]  ( 89 min )
  • Open

    How to Save and Load Your Keras Deep Learning Model
    Keras is a simple and powerful Python library for deep learning. Given that deep learning models can take hours, days and even weeks to train, it is important to know how to save and load them from disk. In this post, you will discover how you can save your Keras models to file and load them […] The post How to Save and Load Your Keras Deep Learning Model appeared first on Machine Learning Mastery.  ( 92 min )
  • Open

    "Microsoft and Facebook join Google in using AI to help run their data centers"
    submitted by /u/gwern [link] [comments]  ( 82 min )
    What research options are available in Atari 2600 games?
    My potential advisor asked me to find the open problems that are available to research in Atari 2600 games. I am new to RL and would highly appreciate it if someone suggest me some papers or give me a few pieces of advice regarding this. I also want to know if there will be any copyright issues if I use Atari ROMs for research purposes? Do I need to purchase the ROMs? submitted by /u/AvailableBike9260 [link] [comments]  ( 83 min )
    What are some "standard" RL algorithms to solve POMDPs?
    I'm starting to learn about POMDPs. I've been reading from here https://cs.brown.edu/research/ai/pomdp/tutorial/index.html in addition to a few papers that use memory to tackle the non-Markovian nature of POMDPs. POMDPs are notoriously difficult to solve due to intractability. I suddenly realized I don't even know of a introductory RL algorithm that solves even simple tabular POMDPs. The algorithms in the link above gives us value iteration algorithms in the planning setting. Normally in RL, you'd teach Q-learning once you get into MDPs, what is the analogous algorithm here for POMDPs? submitted by /u/jhoveen1 [link] [comments]  ( 86 min )
    Help using a Cloud Service for Scaling up Reinforcement Learning
    I want to speed up training for reinforcement learning massively and have been looking into cloud services to do so. I can run my training loop locally, but my batch sizes are quite small. As such, I would like help to set up my training loop in the cloud. I have a budget of $500 for training costs. Would anyone be able to point me in the right direction? submitted by /u/atomicburn125 [link] [comments]  ( 1 min )
  • Open

    Breakthrough BCI Enables Brain-To-Brain Communication | Edge Computing Modular AI Chip
    submitted by /u/tohelpyou88 [link] [comments]  ( 82 min )
    7+ Best Books to Learn Neural Networks in 2022 for Beginners (Updated)
    submitted by /u/Lakshmireddys [link] [comments]  ( 82 min )
  • Open

    Catastrophic overfitting is a bug but also a feature. (arXiv:2206.08242v1 [cs.LG])
    Despite clear computational advantages in building robust neural networks, adversarial training (AT) using single-step methods is unstable as it suffers from catastrophic overfitting (CO): Networks gain non-trivial robustness during the first stages of adversarial training, but suddenly reach a breaking point where they quickly lose all robustness in just a few iterations. Although some works have succeeded at preventing CO, the different mechanisms that lead to this remarkable failure mode are still poorly understood. In this work, however, we find that the interplay between the structure of the data and the dynamics of AT plays a fundamental role in CO. Specifically, through active interventions on typical datasets of natural images, we establish a causal link between the structure of the data and the onset of CO in single-step AT methods. This new perspective provides important insights into the mechanisms that lead to CO and paves the way towards a better understanding of the general dynamics of robust model construction. The code to reproduce the experiments of this paper can be found at https://github.com/gortizji/co_features .  ( 2 min )
    Low-Degree Multicalibration. (arXiv:2203.01255v2 [cs.LG] UPDATED)
    Introduced as a notion of algorithmic fairness, multicalibration has proved to be a powerful and versatile concept with implications far beyond its original intent. This stringent notion -- that predictions be well-calibrated across a rich class of intersecting subpopulations -- provides its strong guarantees at a cost: the computational and sample complexity of learning multicalibrated predictors are high, and grow exponentially with the number of class labels. In contrast, the relaxed notion of multiaccuracy can be achieved more efficiently, yet many of the most desirable properties of multicalibration cannot be guaranteed assuming multiaccuracy alone. This tension raises a key question: Can we learn predictors with multicalibration-style guarantees at a cost commensurate with multiaccuracy? In this work, we define and initiate the study of Low-Degree Multicalibration. Low-Degree Multicalibration defines a hierarchy of increasingly-powerful multi-group fairness notions that spans multiaccuracy and the original formulation of multicalibration at the extremes. Our main technical contribution demonstrates that key properties of multicalibration, related to fairness and accuracy, actually manifest as low-degree properties. Importantly, we show that low-degree multicalibration can be significantly more efficient than full multicalibration. In the multi-class setting, the sample complexity to achieve low-degree multicalibration improves exponentially (in the number of classes) over full multicalibration. Our work presents compelling evidence that low-degree multicalibration represents a sweet spot, pairing computational and sample efficiency with strong fairness and accuracy guarantees.
    Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits. (arXiv:2206.05404v2 [stat.ML] UPDATED)
    We propose a novel algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into additive dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v4 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.
    Sample Efficiency of Data Augmentation Consistency Regularization. (arXiv:2202.12230v2 [cs.LG] UPDATED)
    Data augmentation is popular in the training of large neural networks; currently, however, there is no clear theoretical comparison between different algorithmic choices on how to use augmented data. In this paper, we take a step in this direction - we first present a simple and novel analysis for linear regression with label invariant augmentations, demonstrating that data augmentation consistency (DAC) is intrinsically more efficient than empirical risk minimization on augmented data (DA-ERM). The analysis is then extended to misspecified augmentations (i.e., augmentations that change the labels), which again demonstrates the merit of DAC over DA-ERM. Further, we extend our analysis to non-linear models (e.g., neural networks) and present generalization bounds. Finally, we perform experiments that make a clean and apples-to-apples comparison (i.e., with no extra modeling or data tweaks) between DAC and DA-ERM using CIFAR-100 and WideResNet; these together demonstrate the superior efficacy of DAC.
    Optimal-er Auctions through Attention. (arXiv:2202.13110v3 [cs.LG] UPDATED)
    RegretNet is a recent breakthrough in the automated design of revenue-maximizing auctions. It combines the expressivity of deep learning with the regret-based approach to relax the Incentive Compatibility constraint (that participants benefit from bidding truthfully). We propose two independent modifications of RegretNet, namely a neural architecture based on the attention mechanism, denoted as RegretFormer, and an interpretable loss function that is significantly less sensitive to hyperparameters. We investigate both proposed modifications in an extensive experimental study that includes settings with constant and varied number of items and participants, novel validation procedures, and out-of-setting generalization. We find that RegretFormer consistently outperforms existing architectures in revenue and, unlike existing architectures, is applicable when the input size is variable. Regarding our loss modification, we confirm its effectiveness in controlling the revenue-regret trade-off by varying a single interpretable hyperparameter.
    Multi-Objective Bayesian Optimization over High-Dimensional Search Spaces. (arXiv:2109.10964v4 [cs.LG] UPDATED)
    Many real world scientific and industrial applications require optimizing multiple competing black-box objectives. When the objectives are expensive-to-evaluate, multi-objective Bayesian optimization (BO) is a popular approach because of its high sample efficiency. However, even with recent methodological advances, most existing multi-objective BO methods perform poorly on search spaces with more than a few dozen parameters and rely on global surrogate models that scale cubically with the number of observations. In this work we propose MORBO, a scalable method for multi-objective BO over high-dimensional search spaces. MORBO identifies diverse globally optimal solutions by performing BO in multiple local regions of the design space in parallel using a coordinated strategy. We show that MORBO significantly advances the state-of-the-art in sample efficiency for several high-dimensional synthetic problems and real world applications, including an optical display design problem and a vehicle design problem with 146 and 222 parameters, respectively. On these problems, where existing BO algorithms fail to scale and perform well, MORBO provides practitioners with order-of-magnitude improvements in sample efficiency over the current approach.
    Scheduling Servers with Stochastic Bilinear Rewards. (arXiv:2112.06362v2 [cs.LG] UPDATED)
    In this paper, we study scheduling in multi-class, multi-server queueing systems with stochastic rewards of job-server assignments following a bilinear model in feature vectors characterizing jobs and servers. A bilinear model allows capturing pairwise interactions of features of jobs and servers. Our goal is regret minimization for the objective of maximizing cumulative reward of job-server assignments over a time horizon against an oracle policy that has complete information about system parameters, while maintaining queueing system stable and allowing for different job priorities. The scheduling problem we study is motivated by various applications including matching in online platforms, such as crowdsourcing and labour platforms, and cluster computing systems. We study a scheduling algorithm based on weighted proportionally fair allocation criteria augmented with marginal costs for reward maximization, along with a linear bandit algorithm for estimating rewards of job-server assignments. For a baseline setting, in which jobs have identical mean service times, we show that our algorithm has a sub-linear regret, as well as a sub-linear bound on the mean queue length, in the time horizon. We show that similar bounds hold under more general assumptions, allowing for mean service times to be different across job classes and a time-varying set of server classes. We also show stability conditions for distributed iterative algorithms for computing allocations, which is of interest in large-scale system applications. We demonstrate the efficiency of our algorithms by numerical experiments using both synthetic randomly generated data and a real-world cluster computing data trace.
    Multimeasurement Generative Models. (arXiv:2112.09822v2 [stat.ML] UPDATED)
    We formally map the problem of sampling from an unknown distribution with a density in $\mathbb{R}^d$ to the problem of learning and sampling a smoother density in $\mathbb{R}^{Md}$ obtained by convolution with a fixed factorial kernel: the new density is referred to as M-density and the kernel as multimeasurement noise model (MNM). The M-density in $\mathbb{R}^{Md}$ is smoother than the original density in $\mathbb{R}^d$, easier to learn and sample from, yet for large $M$ the two problems are mathematically equivalent since clean data can be estimated exactly given a multimeasurement noisy observation using the Bayes estimator. To formulate the problem, we derive the Bayes estimator for Poisson and Gaussian MNMs in closed form in terms of the unnormalized M-density. This leads to a simple least-squares objective for learning parametric energy and score functions. We present various parametrization schemes of interest including one in which studying Gaussian M-densities directly leads to multidenoising autoencoders--this is the first theoretical connection made between denoising autoencoders and empirical Bayes in the literature. Samples in $\mathbb{R}^d$ are obtained by walk-jump sampling (Saremi & Hyvarinen, 2019) via underdamped Langevin MCMC (walk) to sample from M-density and the multimeasurement Bayes estimation (jump). We study permutation invariant Gaussian M-densities on MNIST, CIFAR-10, and FFHQ-256 datasets, and demonstrate the effectiveness of this framework for realizing fast-mixing stable Markov chains in high dimensions.
    Transfer Learning In Differential Privacy's Hybrid-Model. (arXiv:2201.12018v2 [cs.LG] UPDATED)
    The hybrid-model (Avent et al 2017) in Differential Privacy is a an augmentation of the local-model where in addition to N local-agents we are assisted by one special agent who is in fact a curator holding the sensitive details of n additional individuals. Here we study the problem of machine learning in the hybrid-model where the n individuals in the curators dataset are drawn from a different distribution than the one of the general population (the local-agents). We give a general scheme -- Subsample-Test-Reweigh -- for this transfer learning problem, which reduces any curator-model DP-learner to a hybrid-model learner in this setting using iterative subsampling and reweighing of the n examples held by the curator based on a smooth variation of the Multiplicative-Weights algorithm (introduced by Bun et al, 2020). Our scheme has a sample complexity which relies on the chi-squared divergence between the two distributions. We give worst-case analysis bounds on the sample complexity required for our private reduction. Aiming to reduce said sample complexity, we give two specific instances our sample complexity can be drastically reduced (one instance is analyzed mathematically, while the other - empirically) and pose several directions for follow-up work.
    STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent. (arXiv:2203.14757v2 [cs.SD] UPDATED)
    We present STUDIES, a new speech corpus for developing a voice agent that can speak in a friendly manner. Humans naturally control their speech prosody to empathize with each other. By incorporating this "empathetic dialogue" behavior into a spoken dialogue system, we can develop a voice agent that can respond to a user more naturally. We designed the STUDIES corpus to include a speaker who speaks with empathy for the interlocutor's emotion explicitly. We describe our methodology to construct an empathetic dialogue speech corpus and report the analysis results of the STUDIES corpus. We conducted a text-to-speech experiment to initially investigate how we can develop more natural voice agent that can tune its speaking style corresponding to the interlocutor's emotion. The results show that the use of interlocutor's emotion label and conversational context embedding can produce speech with the same degree of naturalness as that synthesized by using the agent's emotion label. Our project page of the STUDIES corpus is this http URL
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v2 [stat.ML] UPDATED)
    Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering. (arXiv:2201.05077v3 [cs.SE] UPDATED)
    Deep neural networks (DNNs) have demonstrated superior performance over classical machine learning to support many features in safety-critical systems. Although DNNs are now widely used in such systems (e.g., self driving cars), there is limited progress regarding automated support for functional safety analysis in DNN-based systems. For example, the identification of root causes of errors, to enable both risk analysis and DNN retraining, remains an open problem. In this paper, we propose SAFE, a black-box approach to automatically characterize the root causes of DNN errors. SAFE relies on a transfer learning model pre-trained on ImageNet to extract the features from error-inducing images. It then applies a density-based clustering algorithm to detect arbitrary shaped clusters of images modeling plausible causes of error. Last, clusters are used to effectively retrain and improve the DNN. The black-box nature of SAFE is motivated by our objective not to require changes or even access to the DNN internals to facilitate adoption. Experimental results show the superior ability of SAFE in identifying different root causes of DNN errors based on case studies in the automotive domain. It also yields significant improvements in DNN accuracy after retraining, while saving significant execution time and memory when compared to alternatives.
    Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies. (arXiv:2203.12922v2 [cs.LG] UPDATED)
    This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency~\citep{zhang2020reinforcement} or has an exponential dependency on $S$~\citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.
    OpenFWI: Large-Scale Multi-Structural Benchmark Datasets for Seismic Full Waveform Inversion. (arXiv:2111.02926v3 [cs.LG] UPDATED)
    Full waveform inversion (FWI) is widely used in geophysics to reconstruct high-resolution velocity maps from seismic data. The recent success of data-driven FWI methods results in a rapidly increasing demand for open datasets to serve the geophysics community. We present OpenFWI, a collection of large-scale multi-structural benchmark datasets, to facilitate diversified, rigorous, and reproducible research on FWI. In particular, OpenFWI consists of 12 datasets (2.1TB in total) synthesized from multiple sources. It encompasses diverse domains in geophysics (interface, fault, CO2 reservoir, etc.), covers different geological subsurface structures (flat, curve, etc.), and contains various amounts of data samples (2K - 67K). It also includes a dataset for 3D FWI. Moreover, we use OpenFWI to perform benchmarking over four deep learning methods, covering both supervised and unsupervised learning regimes. In addition to evaluations on a single dataset, OpenFWI enables the study of generalization across datasets. Our study uncovers that the deep learning methods generalize poorly across domains, and the degradation connects to the complexity of subsurface structures. We hope OpenFWI facilitates diversified, rigorous, and reproducible research in the geophysics and machine learning community. All datasets and related information can be accessed through our website at https://openfwi-lanl.github.io/
    SCORE: Approximating Curvature Information under Self-Concordant Regularization. (arXiv:2112.07344v2 [cs.LG] UPDATED)
    In this paper, we propose the SCORE (self-concordant regularization) framework for unconstrained minimization problems which incorporates second-order information in the Newton decrement framework for convex optimization. We propose the generalized Gauss-Newton with Self-Concordant Regularization (GGN-SCORE) algorithm that updates the minimization variables each time it receives a new input batch. The proposed algorithm exploits the structure of the second-order information in the Hessian matrix, thereby reducing computational overhead. GGN-SCORE demonstrates how we may speed up convergence while also improving model generalization for problems that involve regularized minimization under the SCORE framework. Numerical experiments show the efficiency of our method and its fast convergence, which compare favorably against baseline first-order and quasi-Newton methods. Additional experiments involving non-convex (overparameterized) neural network training problems show similar convergence behaviour thereby highlighting the promise of the proposed algorithm for non-convex optimization.
    Benchmarking Heterogeneous Treatment Effect Models through the Lens of Interpretability. (arXiv:2206.08363v1 [cs.LG])
    Estimating personalized effects of treatments is a complex, yet pervasive problem. To tackle it, recent developments in the machine learning (ML) literature on heterogeneous treatment effect estimation gave rise to many sophisticated, but opaque, tools: due to their flexibility, modularity and ability to learn constrained representations, neural networks in particular have become central to this literature. Unfortunately, the assets of such black boxes come at a cost: models typically involve countless nontrivial operations, making it difficult to understand what they have learned. Yet, understanding these models can be crucial -- in a medical context, for example, discovered knowledge on treatment effect heterogeneity could inform treatment prescription in clinical practice. In this work, we therefore use post-hoc feature importance methods to identify features that influence the model's predictions. This allows us to evaluate treatment effect estimators along a new and important dimension that has been overlooked in previous work: We construct a benchmarking environment to empirically investigate the ability of personalized treatment effect models to identify predictive covariates -- covariates that determine differential responses to treatment. Our benchmarking environment then enables us to provide new insight into the strengths and weaknesses of different types of treatment effects models as we modulate different challenges specific to treatment effect estimation -- e.g. the ratio of prognostic to predictive information, the possible nonlinearity of potential outcomes and the presence and type of confounding.
    An accelerated expectation-maximization algorithm for multi-reference alignment. (arXiv:2105.07372v2 [eess.SP] UPDATED)
    The multi-reference alignment (MRA) problem entails estimating an image from multiple noisy and rotated copies of itself. If the noise level is low, one can reconstruct the image by estimating the missing rotations, aligning the images, and averaging out the noise. While accurate rotation estimation is impossible if the noise level is high, the rotations can still be approximated, and thus can provide indispensable information. In particular, learning the approximation error can be harnessed for efficient image estimation. In this paper, we propose a new computational framework, called Synch-EM, that consists of angular synchronization followed by expectation-maximization (EM). The synchronization step results in a concentrated distribution of rotations; this distribution is learned and then incorporated into the EM as a Bayesian prior. The learned distribution also dramatically reduces the search space, and thus the computational load, of the EM iterations. We show by extensive numerical experiments that the proposed framework can significantly accelerate EM for MRA in high noise levels, occasionally by a few orders of magnitude, without degrading the reconstruction quality.
    Fuzzy Logic Based Logical Query Answering on Knowledge Graphs. (arXiv:2108.02390v2 [cs.LG] UPDATED)
    Answering complex First-Order Logical (FOL) queries on large-scale incomplete knowledge graphs (KGs) is an important yet challenging task. Recent advances embed logical queries and KG entities in the same space and conduct query answering via dense similarity search. However, most logical operators designed in previous studies do not satisfy the axiomatic system of classical logic, limiting their performance. Moreover, these logical operators are parameterized and thus require many complex FOL queries as training data, which are often arduous to collect or even inaccessible in most real-world KGs. We thus present FuzzQE, a fuzzy logic based logical query embedding framework for answering FOL queries over KGs. FuzzQE follows fuzzy logic to define logical operators in a principled and learning-free manner, where only entity and relation embeddings require learning. FuzzQE can further benefit from labeled complex logical queries for training. Extensive experiments on two benchmark datasets demonstrate that FuzzQE provides significantly better performance in answering FOL queries compared to state-of-the-art methods. In addition, FuzzQE trained with only KG link prediction can achieve comparable performance to those trained with extra complex query data.
    Learning with little mixing. (arXiv:2206.08269v1 [cs.LG])
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.
    Switchable Representation Learning Framework with Self-compatibility. (arXiv:2206.08289v1 [cs.AI])
    Real-world visual search systems involve deployments on multiple platforms with different computing and storage resources. Deploying a unified model that suits the minimal-constrain platforms leads to limited accuracy. It is expected to deploy models with different capacities adapting to the resource constraints, which requires features extracted by these models to be aligned in the metric space. The method to achieve feature alignments is called "compatible learning". Existing research mainly focuses on the one-to-one compatible paradigm, which is limited in learning compatibility among multiple models. We propose a Switchable representation learning Framework with Self-Compatibility (SFSC). SFSC generates a series of compatible sub-models with different capacities through one training process. The optimization of sub-models faces gradients conflict, and we mitigate it from the perspective of the magnitude and direction. We adjust the priorities of sub-models dynamically through uncertainty estimation to co-optimize sub-models properly. Besides, the gradients with conflicting directions are projected to avoid mutual interference. SFSC achieves state-of-art performance on the evaluated dataset.
    Applying Machine Learning to Crowd-sourced Data from Earthquake Detective. (arXiv:2011.04740v2 [physics.geo-ph] UPDATED)
    Dynamically triggered earthquakes and tremor generate two classes of weak seismic signals whose detection, identification, and authentication traditionally call for laborious analyses. Machine learning (ML) has grown in recent years to be a powerful efficiency-boosting tool in geophysical analyses, including the detection of specific signals in time series. However, detecting weak signals that are buried in noise challenges ML algorithms, in part because ubiquitous training data is not always available. Under these circumstances, ML can be as ineffective as human experts are inefficient. At this intersection of effectiveness and efficiency, we leverage a third tool that has grown in popularity over the past decade: Citizen science. Citizen science project Earthquake Detective leverages the eyes and ears of volunteers to detect and classify weak signals in seismograms from potentially dynamically triggered (PDT) events. Here, we present the Earthquake Detective data set - A crowd-sourced set of labels on PDT earthquakes and tremor. We apply Machine Learning to classify these PDT seismic events and explore the challenges faced in segregating and classifying such weak signals. We confirm that with an image- and wavelet-based algorithm, machine learning can detect signals from small earthquakes. In addition, we report that our ML algorithm can also detect signals from PDT tremor, which has not been previously demonstrated. The citizen science data set of classifications and ML code are available online.
    Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features. (arXiv:2111.02363v3 [eess.AS] UPDATED)
    In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in PESQ prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in STOI prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in MOS prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.
    Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation. (arXiv:2206.08050v1 [cs.IR])
    Shared-account Cross-domain Sequential Recommendation (SCSR) task aims to recommend the next item via leveraging the mixed user behaviors in multiple domains. It is gaining immense research attention as more and more users tend to sign up on different platforms and share accounts with others to access domain-specific services. Existing works on SCSR mainly rely on mining sequential patterns via Recurrent Neural Network (RNN)-based models, which suffer from the following limitations: 1) RNN-based methods overwhelmingly target discovering sequential dependencies in single-user behaviors. They are not expressive enough to capture the relationships among multiple entities in SCSR. 2) All existing methods bridge two domains via knowledge transfer in the latent space, and ignore the explicit cross-domain graph structure. 3) None existing studies consider the time interval information among items, which is essential in the sequential recommendation for characterizing different items and learning discriminative representations for them. In this work, we propose a new graph-based solution, namely TiDA-GCN, to address the above challenges. Specifically, we first link users and items in each domain as a graph. Then, we devise a domain-aware graph convolution network to learn userspecific node representations. To fully account for users' domainspecific preferences on items, two effective attention mechanisms are further developed to selectively guide the message passing process. Moreover, to further enhance item- and account-level representation learning, we incorporate the time interval into the message passing, and design an account-aware self-attention module for learning items' interactive characteristics. Experiments demonstrate the superiority of our proposed method from various aspects.
    Benchmarking Differential Privacy and Federated Learning for BERT Models. (arXiv:2106.13973v2 [cs.CL] UPDATED)
    Natural Language Processing (NLP) techniques can be applied to help with the diagnosis of medical conditions such as depression, using a collection of a person's utterances. Depression is a serious medical illness that can have adverse effects on how one feels, thinks, and acts, which can lead to emotional and physical problems. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training models with such data. In this work, we study the effects that the application of Differential Privacy (DP) has, in both a centralized and a Federated Learning (FL) setup, on training contextualized language models (BERT, ALBERT, RoBERTa and DistilBERT). We offer insights on how to privately train NLP models and what architectures and setups provide more desirable privacy utility trade-offs. We envisage this work to be used in future healthcare and mental health studies to keep medical history private. Therefore, we provide an open-source implementation of this work.
    Universality of Winning Tickets: A Renormalization Group Perspective. (arXiv:2110.03210v3 [cs.LG] UPDATED)
    Foundational work on the Lottery Ticket Hypothesis has suggested an exciting corollary: winning tickets found in the context of one task can be transferred to similar tasks, possibly even across different architectures. This has generated broad interest, but methods to study this universality are lacking. We make use of renormalization group theory, a powerful tool from theoretical physics, to address this need. We find that iterative magnitude pruning, the principal algorithm used for discovering winning tickets, is a renormalization group scheme, and can be viewed as inducing a flow in parameter space. We demonstrate that ResNet-50 models with transferable winning tickets have flows with common properties, as would be expected from the theory. Similar observations are made for BERT models, with evidence that their flows are near fixed points. Additionally, we leverage our framework to study winning tickets transferred across ResNet architectures, observing that smaller models have flows with more uniform properties than larger models, complicating transfer between them.
    CENN: Conservative energy method based on neural networks with subdomains for solving variational problems involving heterogeneous and complex geometries. (arXiv:2110.01359v3 [math.NA] UPDATED)
    We propose a conservative energy method based on neural networks with subdomains for solving variational problems (CENN), where the admissible function satisfying the essential boundary condition without boundary penalty is constructed by the radial basis function (RBF), particular solution neural network, and general neural network. Loss term is the potential energy, optimized based on the principle of minimum potential energy. The loss term at the interfaces has the lower order derivative compared to the strong form PINN with subdomains. The advantage of the proposed method is higher efficiency, more accurate, and less hyperparameters than the strong form PINN with subdomains. Another advantage of the proposed method is that it can apply to complex geometries based on the special construction of the admissible function. To analyze its performance, the proposed method CENN is used to model representative PDEs, the examples include strong discontinuity, singularity, complex boundary, non-linear, and heterogeneous problems. Furthermore, it outperforms other methods when dealing with heterogeneous problems.
    Deep Reference Priors: What is the best way to pretrain a model?. (arXiv:2202.00187v2 [stat.ML] UPDATED)
    What is the best way to exploit extra data -- be it unlabeled data from the same task, or labeled data from a related task -- to learn a given task? This paper formalizes the question using the theory of reference priors. Reference priors are objective, uninformative Bayesian priors that maximize the mutual information between the task and the weights of the model. Such priors enable the task to maximally affect the Bayesian posterior, e.g., reference priors depend upon the number of samples available for learning the task and for very small sample sizes, the prior puts more probability mass on low-complexity models in the hypothesis space. This paper presents the first demonstration of reference priors for medium-scale deep networks and image-based data. We develop generalizations of reference priors and demonstrate applications to two problems. First, by using unlabeled data to compute the reference prior, we develop new Bayesian semi-supervised learning methods that remain effective even with very few samples per class. Second, by using labeled data from the source task to compute the reference prior, we develop a new pretraining method for transfer learning that allows data from the target task to maximally affect the Bayesian posterior. Empirical validation of these methods is conducted on image classification datasets. Code is available at https://github.com/grasp-lyrl/deep_reference_priors.
    MixGen: A New Multi-Modal Data Augmentation. (arXiv:2206.08358v1 [cs.CV])
    Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on RefCOCO+), visual reasoning (+0.9% on NLVR$^{2}$), visual question answering (+0.3% on VQA2.0), and visual entailment (+0.4% on SNLI-VE).
    Preserved central model for faster bidirectional compression in distributed settings. (arXiv:2102.12528v2 [cs.LG] UPDATED)
    We develop a new approach to tackle communication constraints in a distributed learning problem with a central server. We propose and analyze a new algorithm that performs bidirectional compression and achieves the same convergence rate as algorithms using only uplink (from the local workers to the central server) compression. To obtain this improvement, we design MCM, an algorithm such that the downlink compression only impacts local models, while the global model is preserved. As a result, and contrary to previous works, the gradients on local servers are computed on perturbed models. Consequently, convergence proofs are more challenging and require a precise control of this perturbation. To ensure it, MCM additionally combines model compression with a memory mechanism. This analysis opens new doors, e.g. incorporating worker dependent randomized-models and partial participation.
    Solving Inverse Problems in Medical Imaging with Score-Based Generative Models. (arXiv:2111.08005v2 [eess.IV] UPDATED)
    Reconstructing medical images from partial measurements is an important inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing solutions based on machine learning typically train a model to directly map measurements to medical images, leveraging a training dataset of paired images and measurements. These measurements are typically synthesized from images using a fixed physical model of the measurement process, which hinders the generalization capability of models to unknown measurement processes. To address this issue, we propose a fully unsupervised technique for inverse problem solving, leveraging the recently introduced score-based generative models. Specifically, we first train a score-based generative model on medical images to capture their prior distribution. Given measurements and a physical model of the measurement process at test time, we introduce a sampling method to reconstruct an image consistent with both the prior and the observed measurements. Our method does not assume a fixed measurement process during training, and can thus be flexibly adapted to different measurement processes at test time. Empirically, we observe comparable or better performance to supervised learning techniques in several medical imaging tasks in CT and MRI, while demonstrating significantly better generalization to unknown measurement processes.
    A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources. (arXiv:2103.06261v3 [stat.ML] UPDATED)
    Accurately estimating personalized treatment effects within a study site (e.g., a hospital) has been challenging due to limited sample size. Furthermore, privacy considerations and lack of resources prevent a site from leveraging subject-level data from other sites. We propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects (CATE) at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. To our best knowledge, there is no established model averaging approach for distributed data with a focus on improving the estimation of treatment effects. Specifically, under distributed data networks, our framework provides an interpretable tree-based ensemble of CATE estimators that joins models across study sites, while actively modeling the heterogeneity in data sources through site partitioning. The performance of this approach is demonstrated by a real-world study of the causal effects of oxygen therapy on hospital survival rate and backed up by comprehensive simulation results.
    mlf-core: a framework for deterministic machine learning. (arXiv:2104.07651v2 [cs.MS] UPDATED)
    Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. However, major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations. Solely fixing all random seeds is not sufficient for deterministic machine learning. To overcome this shortcoming, various machine learning libraries released deterministic counterparts to the non-deterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in CT scans, and a liver cancer classifier based on gene expression profiles with XGBoost.
    The Portiloop: a deep learning-based open science tool for closed-loop brain stimulation. (arXiv:2107.13473v3 [eess.SP] UPDATED)
    Closed-loop brain stimulation refers to capturing neurophysiological measures such as electroencephalography (EEG), quickly identifying neural events of interest, and producing auditory, magnetic or electrical stimulation so as to interact with brain processes precisely. It is a promising new method for fundamental neuroscience and perhaps for clinical applications such as restoring degraded memory function; however, existing tools are expensive, cumbersome, and offer limited experimental flexibility. In this article, we propose the Portiloop, a deep learning-based, portable and low-cost closed-loop stimulation system able to target specific brain oscillations. We first document open-hardware implementations that can be constructed from commercially available components. We also provide a fast, lightweight neural network model and an exploration algorithm that automatically optimizes the model hyperparameters to the desired brain oscillation. Finally, we validate the technology on a challenging test case of real-time sleep spindle detection, with results comparable to off-line expert performance on the Massive Online Data Annotation spindle dataset (MODA; group consensus). Software and plans are available to the community as an open science initiative to encourage further development and advance closed-loop neuroscience research.
    Learning to Denoise Historical Music. (arXiv:2008.02027v2 [eess.AS] UPDATED)
    We propose an audio-to-audio neural network model that learns to denoise old music recordings. Our model internally converts its input into a time-frequency representation by means of a short-time Fourier transform (STFT), and processes the resulting complex spectrogram using a convolutional neural network. The network is trained with both reconstruction and adversarial objectives on a synthetic noisy music dataset, which is created by mixing clean music with real noise samples extracted from quiet segments of old recordings. We evaluate our method quantitatively on held-out test examples of the synthetic dataset, and qualitatively by human rating on samples of actual historical recordings. Our results show that the proposed method is effective in removing noise, while preserving the quality and details of the original music.
    Federated Learning on the Road: Autonomous Controller Design for Connected and Autonomous Vehicles. (arXiv:2102.03401v2 [eess.SY] UPDATED)
    A new federated learning (FL) framework enabled by large-scale wireless connectivity is proposed for designing the autonomous controller of connected and autonomous vehicles (CAVs). In this framework, the learning models used by the controllers are collaboratively trained among a group of CAVs. To capture the varying CAV participation in the FL training process and the diverse local data quality among CAVs, a novel dynamic federated proximal (DFP) algorithm is proposed that accounts for the mobility of CAVs, the wireless fading channels, as well as the unbalanced and nonindependent and identically distributed data across CAVs. A rigorous convergence analysis is performed for the proposed algorithm to identify how fast the CAVs converge to using the optimal autonomous controller. In particular, the impacts of varying CAV participation in the FL process and diverse CAV data quality on the convergence of the proposed DFP algorithm are explicitly analyzed. Leveraging this analysis, an incentive mechanism based on contract theory is designed to improve the FL convergence speed. Simulation results using real vehicular data traces show that the proposed DFP-based controller can accurately track the target CAV speed over time and under different traffic scenarios. Moreover, the results show that the proposed DFP algorithm has a much faster convergence compared to popular FL algorithms such as federated averaging (FedAvg) and federated proximal (FedProx). The results also validate the feasibility of the contract-theoretic incentive mechanism and show that the proposed mechanism can improve the convergence speed of the DFP algorithm by 40% compared to the baselines.
    LSB: Local Self-Balancing MCMC in Discrete Spaces. (arXiv:2109.03867v3 [cs.AI] UPDATED)
    We present the Local Self-Balancing sampler (LSB), a local Markov Chain Monte Carlo (MCMC) method for sampling in purely discrete domains, which is able to autonomously adapt to the target distribution and to reduce the number of target evaluations required to converge. LSB is based on (i) a parametrization of locally balanced proposals, (ii) a newly proposed objective function based on mutual information and (iii) a self-balancing learning procedure, which minimises the proposed objective to update the proposal parameters. Experiments on energy-based models and Markov networks show that LSB converges using a smaller number of queries to the oracle distribution compared to recent local MCMC samplers.
    CausalAF: Causal Autoregressive Flow for Safety-Critical Driving Scenario Generation. (arXiv:2110.13939v2 [cs.CV] UPDATED)
    Generating safety-critical scenarios, which are crucial yet difficult to collect, provides an effective way to evaluate the robustness of autonomous driving systems. However, the diversity of scenarios and efficiency of generation methods are heavily restricted by the rareness and structure of safety-critical scenarios. Therefore, existing generative models that only estimate distributions from observational data are not satisfying to solve this problem. In this paper, we integrate causality as a prior into the scenario generation and propose a flow-based generative framework, Causal Autoregressive Flow (CausalAF). CausalAF encourages the generative model to uncover and follow the causal relationship among generated objects via novel causal masking operations instead of searching the sample only from observational data. By learning the cause-and-effect mechanism of how the generated scenario causes risk situations rather than just learning correlations from data, CausalAF significantly improves learning efficiency. Extensive experiments on three heterogeneous traffic scenarios illustrate that CausalAF requires much fewer optimization resources to effectively generate safety-critical scenarios. We also show that using generated scenarios as additional training samples empirically improves the robustness of autonomous driving algorithms.
    Finite-Time Convergence Rates of Decentralized Stochastic Approximation with Applications in Multi-Agent and Multi-Task Learning. (arXiv:2010.15088v2 [cs.LG] UPDATED)
    We study a decentralized variant of stochastic approximation, a data-driven approach for finding the root of an operator under noisy measurements. A network of agents, each with its own operator and data observations, cooperatively find the fixed point of the aggregate operator over a decentralized communication graph. Our main contribution is to provide a finite-time analysis of this decentralized stochastic approximation method when the data observed at each agent are sampled from a Markov process; this lack of independence makes the iterates biased and (potentially) unbounded. Under fairly standard assumptions, we show that the convergence rate of the proposed method is essentially the same as if the samples were independent, differing only by a log factor that accounts for the mixing time of the Markov processes. The key idea in our analysis is to introduce a novel Razumikhin-Lyapunov function, motivated by the one used in analyzing the stability of delayed ordinary differential equations. We also discuss applications of the proposed method on a number of interesting learning problems in multi-agent systems.
    Classical Planning in Deep Latent Space. (arXiv:2107.00110v3 [cs.AI] UPDATED)
    Current domain-independent, classical planners require symbolic models of the problem domain and instance as input, resulting in a knowledge acquisition bottleneck. Meanwhile, although deep learning has achieved significant success in many fields, the knowledge is encoded in a subsymbolic representation which is incompatible with symbolic systems such as planners. We propose Latplan, an unsupervised architecture combining deep learning and classical planning. Given only an unlabeled set of image pairs showing a subset of transitions allowed in the environment (training inputs), Latplan learns a complete propositional PDDL action model of the environment. Later, when a pair of images representing the initial and the goal states (planning inputs) is given, Latplan finds a plan to the goal state in a symbolic latent space and returns a visualized plan execution. We evaluate Latplan using image-based versions of 6 planning domains: 8-puzzle, 15-Puzzle, Blocksworld, Sokoban and Two variations of LightsOut.
    Estimating Categorical Counterfactuals via Deep Twin Networks. (arXiv:2109.01904v4 [cs.LG] UPDATED)
    Counterfactual inference is a powerful tool, capable of solving challenging problems in high-profile sectors. To perform counterfactual inference, one requires knowledge of the underlying causal mechanisms. However, causal mechanisms cannot be uniquely determined from observations and interventions alone. This raises the question of how to choose the causal mechanisms so that resulting counterfactual inference is trustworthy in a given domain. This question has been addressed in causal models with binary variables, but the case of categorical variables remains unanswered. We address this challenge by introducing for causal models with categorical variables the notion of counterfactual ordering, a principle that posits desirable properties causal mechanisms should posses, and prove that it is equivalent to specific functional constraints on the causal mechanisms. To learn causal mechanisms satisfying these constraints, and perform counterfactual inference with them, we introduce deep twin networks. These are deep neural networks that, when trained, are capable of twin network counterfactual inference -- an alternative to the abduction, action, & prediction method. We empirically test our approach on diverse real-world and semi-synthetic data from medicine, epidemiology, and finance, reporting accurate estimation of counterfactual probabilities while demonstrating the issues that arise with counterfactual reasoning when counterfactual ordering is not enforced.
    Neural tangent kernel analysis of shallow $\alpha$-Stable ReLU neural networks. (arXiv:2206.08065v1 [cs.LG])
    There is a recent literature on large-width properties of Gaussian neural networks (NNs), i.e. NNs whose weights are distributed according to Gaussian distributions. Two popular problems are: i) the study of the large-width behaviour of NNs, which provided a characterization of the infinitely wide limit of a rescaled NN in terms of a Gaussian process; ii) the study of the large-width training dynamics of NNs, which set forth an equivalence between training the rescaled NN and performing a kernel regression with a deterministic kernel referred to as the neural tangent kernel (NTK). In this paper, we consider these problems for $\alpha$-Stable NNs, which generalize Gaussian NNs by assuming that the NN's weights are distributed as $\alpha$-Stable distributions with $\alpha\in(0,2]$, i.e. distributions with heavy tails. For shallow $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable process, i.e. a stochastic process with $\alpha$-Stable finite-dimensional distributions. As a novelty with respect to the Gaussian setting, in the $\alpha$-Stable setting the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU function requires an additional logarithmic scaling with respect to sub-linear functions. Then, our main contribution is the NTK analysis of shallow $\alpha$-Stable ReLU-NNs, which leads to an equivalence between training a rescaled NN and performing a kernel regression with an $(\alpha/2)$-Stable random kernel. The randomness of such a kernel is a further novelty with respect to the Gaussian setting, that is: in the $\alpha$-Stable setting the randomness of the NN at initialization does not vanish in the NTK analysis, thus inducing a distribution for the kernel of the underlying kernel regression.
    Face Anti-Spoofing by Learning Polarization Cues in a Real-World Scenario. (arXiv:2003.08024v3 [cs.CV] UPDATED)
    Face anti-spoofing is the key to preventing security breaches in biometric recognition applications. Existing software-based and hardware-based face liveness detection methods are effective in constrained environments or designated datasets only. Deep learning method using RGB and infrared images demands a large amount of training data for new attacks. In this paper, we present a face anti-spoofing method in a real-world scenario by automatic learning the physical characteristics in polarization images of a real face compared to a deceptive attack. A computational framework is developed to extract and classify the unique face features using convolutional neural networks and SVM together. Our real-time polarized face anti-spoofing (PAAS) detection method uses a on-chip integrated polarization imaging sensor with optimized processing algorithms. Extensive experiments demonstrate the advantages of the PAAS technique to counter diverse face spoofing attacks (print, replay, mask) in uncontrolled indoor and outdoor conditions by learning polarized face images of 33 people. A four-directional polarized face image dataset is released to inspire future applications within biometric anti-spoofing field.
    DEEMD: Drug Efficacy Estimation against SARS-CoV-2 based on cell Morphology with Deep multiple instance learning. (arXiv:2105.05758v2 [cs.LG] UPDATED)
    Drug repurposing can accelerate the identification of effective compounds for clinical use against SARS-CoV-2, with the advantage of pre-existing clinical safety data and an established supply chain. RNA viruses such as SARS-CoV-2 manipulate cellular pathways and induce reorganization of subcellular structures to support their life cycle. These morphological changes can be quantified using bioimaging techniques. In this work, we developed DEEMD: a computational pipeline using deep neural network models within a multiple instance learning framework, to identify putative treatments effective against SARS-CoV-2 based on morphological analysis of the publicly available RxRx19a dataset. This dataset consists of fluorescence microscopy images of SARS-CoV-2 non-infected cells and infected cells, with and without drug treatment. DEEMD first extracts discriminative morphological features to generate cell morphological profiles from the non-infected and infected cells. These morphological profiles are then used in a statistical model to estimate the applied treatment efficacy on infected cells based on similarities to non-infected cells. DEEMD is capable of localizing infected cells via weak supervision without any expensive pixel-level annotations. DEEMD identifies known SARS-CoV-2 inhibitors, such as Remdesivir and Aloxistatin, supporting the validity of our approach. DEEMD can be explored for use on other emerging viruses and datasets to rapidly identify candidate antiviral treatments in the future}. Our implementation is available online at https://www.github.com/Sadegh-Saberian/DEEMD
    Learning Models of Individual Behavior in Chess. (arXiv:2008.10086v3 [cs.AI] UPDATED)
    AI systems that can capture human-like behavior are becoming increasingly useful in situations where humans may want to learn from these systems, collaborate with them, or engage with them as partners for an extended duration. In order to develop human-oriented AI systems, the problem of predicting human actions -- as opposed to predicting optimal actions -- has received considerable attention. Existing work has focused on capturing human behavior in an aggregate sense, which potentially limits the benefit any particular individual could gain from interaction with these systems. We extend this line of work by developing highly accurate predictive models of individual human behavior in chess. Chess is a rich domain for exploring human-AI interaction because it combines a unique set of properties: AI systems achieved superhuman performance many years ago, and yet humans still interact with them closely, both as opponents and as preparation tools, and there is an enormous corpus of recorded data on individual player games. Starting with Maia, an open-source version of AlphaZero trained on a population of human players, we demonstrate that we can significantly improve prediction accuracy of a particular player's moves by applying a series of fine-tuning methods. Furthermore, our personalized models can be used to perform stylometry -- predicting who made a given set of moves -- indicating that they capture human decision-making at an individual level. Our work demonstrates a way to bring AI systems into better alignment with the behavior of individual people, which could lead to large improvements in human-AI interaction.
    OpenCoS: Contrastive Semi-supervised Learning for Handling Open-set Unlabeled Data. (arXiv:2107.08943v2 [cs.CV] UPDATED)
    Semi-supervised learning (SSL) is one of the most promising paradigms to circumvent the expensive labeling cost for building a high-performance model. Most existing SSL methods conventionally assume both labeled and unlabeled data are drawn from the same (class) distribution. However, unlabeled data may include out-of-class samples in practice; those that cannot have one-hot encoded labels from a closed-set of classes in label data, i.e. unlabeled data is an open-set. In this paper, we introduce OpenCoS, a method for handling this realistic semi-supervised learning scenario based upon a recent framework of self-supervised visual representation learning. Specifically, we first observe that the out-of-class samples in the open-set unlabeled dataset can be identified effectively via self-supervised contrastive learning. Then, OpenCoS utilizes this information to overcome the failure modes in the existing state-of-the-art semi-supervised methods, by utilizing one-hot pseudo-labels and soft-labels for the identified in- and out-of-class unlabeled data, respectively. Our extensive experimental results show the effectiveness of OpenCoS, fixing up the state-of-the-art semi-supervised methods to be suitable for diverse scenarios involving open-set unlabeled data.
    Long Range Graph Benchmark. (arXiv:2206.08164v1 [cs.LG])
    Graph Neural Networks (GNNs) that are based on the message passing (MP) paradigm exchange information between 1-hop neighbors to build node representations at each layer. In principle, such networks are not able to capture long-range interactions (LRI) that may be desired or necessary for learning a given task on graphs. Recently, there has been an increasing interest in development of Transformer-based methods for graphs that can consider full node connectivity beyond the original sparse structure, thus enabling the modeling of LRI. However, MP-GNNs that simply rely on 1-hop message passing often fare better in several existing graph benchmarks when combined with positional feature representations, among other innovations, hence limiting the perceived utility and ranking of Transformer-like architectures. Here, we present the Long Range Graph Benchmark (LRGB) with 5 graph learning datasets: PascalVOC-SP, COCO-SP, PCQM-Contact, Peptides-func and Peptides-struct that arguably require LRI reasoning to achieve strong performance in a given task. We benchmark both baseline GNNs and Graph Transformer networks to verify that the models which capture long-range dependencies perform significantly better on these tasks. Therefore, these datasets are suitable for benchmarking and exploration of MP-GNNs and Graph Transformer architectures that are intended to capture LRI.
    New Versions of Gradient Temporal Difference Learning. (arXiv:2109.04033v2 [cs.LG] UPDATED)
    Sutton, Szepesv\'{a}ri and Maei introduced the first gradient temporal-difference (GTD) learning algorithms compatible with both linear function approximation and off-policy training. The goal of this paper is (a) to propose some variants of GTDs with extensive comparative analysis and (b) to establish new theoretical analysis frameworks for the GTDs. These variants are based on convex-concave saddle-point interpretations of GTDs, which effectively unify all the GTDs into a single framework, and provide simple stability analysis based on recent results on primal-dual gradient dynamics. Finally, numerical comparative analysis is given to evaluate these approaches.
    Phase transitions in nonparametric regressions: a curse of exploiting higher degree smoothness assumptions in finite samples. (arXiv:2112.03626v3 [math.ST] UPDATED)
    When the regression function belongs to the smooth classes consisting of univariate functions with derivatives up to the $(\gamma+1)$th order bounded in absolute values by a common constant everywhere or a.e., it is generally viewed that exploiting higher degree smoothness assumption helps reduce the estimation error. This paper shows that the minimax optimal mean integrated squared error (MISE) rate increases in $\gamma$ when the sample size $n$ is small relative to $\left(\gamma+1\right)^{2\gamma+3}$ (e.g., $\left(\gamma+1\right)^{2\gamma+3}=262144$ when $\gamma=3$), and decreases in $\gamma$ when $n$ is large relative to $\left(\gamma+1\right)^{2\gamma+3}$. In particular, this phase transition property is shown to be achieved by common nonparametric procedures. Consider $\gamma_{1}$ and $\gamma_{2}$ such that $\gamma_{1}<\gamma_{2}$, where the $(\gamma_{2}+1)$th degree smoothness class is a subset of the $(\gamma_{1}+1)$th degree class. What is interesting about our results is that they imply, if $n$ is small relative to $\left(\gamma_{1}+1\right)^{2\gamma_{1}+3}$, the optimal rate achieved by the estimator constrained to be in the smoother class is larger. In data sets with fewer than hundreds-of-thousands observations, our results suggest that one should not exploit beyond the third degree of smoothness. To some extent, our results provide a theoretical basis for the widely adopted practical recommendation given by Gelman and Imbens (2019). The building blocks of our minimax optimality results are a set of metric entropy bounds we develop in this paper for smooth function classes. Some of our bounds are original, and some of them refine and/or generalize the ones in the literature.
    Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination. (arXiv:2206.07989v1 [cs.LG])
    The learned policy of model-free offline reinforcement learning (RL) methods is often constrained to stay within the support of datasets to avoid possible dangerous out-of-distribution actions or states, making it challenging to handle out-of-support region. Model-based RL methods offer a richer dataset and benefit generalization by generating imaginary trajectories with either trained forward or reverse dynamics model. However, the imagined transitions may be inaccurate, thus downgrading the performance of the underlying offline RL method. In this paper, we propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. We introduce conservatism by trusting samples that the forward model and backward model agree on. Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method. Experimental results on the D4RL benchmarks demonstrate that our method significantly boosts the performance of existing model-free offline RL algorithms and achieves competitive or better scores against baseline methods.
    Learning to Infer Structures of Network Games. (arXiv:2206.08119v1 [cs.LG])
    Strategic interactions between a group of individuals or organisations can be modelled as games played on networks, where a player's payoff depends not only on their actions but also on those of their neighbours. Inferring the network structure from observed game outcomes (equilibrium actions) is an important problem with numerous potential applications in economics and social sciences. Existing methods mostly require the knowledge of the utility function associated with the game, which is often unrealistic to obtain in real-world scenarios. We adopt a transformer-like architecture which correctly accounts for the symmetries of the problem and learns a mapping from the equilibrium actions to the network structure of the game without explicit knowledge of the utility function. We test our method on three different types of network games using both synthetic and real-world data, and demonstrate its effectiveness in network structure inference and superior performance over existing methods.
    NCGNN: Node-Level Capsule Graph Neural Network for Semisupervised Classification. (arXiv:2012.03476v2 [cs.LG] UPDATED)
    Message passing has evolved as an effective tool for designing Graph Neural Networks (GNNs). However, most existing methods for message passing simply sum or average all the neighboring features to update node representations. They are restricted by two problems, i.e., (i) lack of interpretability to identify node features significant to the prediction of GNNs, and (ii) feature over-mixing that leads to the over-smoothing issue in capturing long-range dependencies and inability to handle graphs under heterophily or low homophily. In this paper, we propose a Node-level Capsule Graph Neural Network (NCGNN) to address these problems with an improved message passing scheme. Specifically, NCGNN represents nodes as groups of node-level capsules, in which each capsule extracts distinctive features of its corresponding node. For each node-level capsule, a novel dynamic routing procedure is developed to adaptively select appropriate capsules for aggregation from a subgraph identified by the designed graph filter. NCGNN aggregates only the advantageous capsules and restrains irrelevant messages to avoid over-mixing features of interacting nodes. Therefore, it can relieve the over-smoothing issue and learn effective node representations over graphs with homophily or heterophily. Furthermore, our proposed message passing scheme is inherently interpretable and exempt from complex post-hoc explanations, as the graph filter and the dynamic routing procedure identify a subset of node features that are most significant to the model prediction from the extracted subgraph. Extensive experiments on synthetic as well as real-world graphs demonstrate that NCGNN can well address the over-smoothing issue and produce better node representations for semisupervised node classification. It outperforms the state of the arts under both homophily and heterophily.
    Simultaneously Learning Stochastic and Adversarial Bandits with General Graph Feedback. (arXiv:2206.07908v1 [cs.LG])
    The problem of online learning with graph feedback has been extensively studied in the literature due to its generality and potential to model various learning tasks. Existing works mainly study the adversarial and stochastic feedback separately. If the prior knowledge of the feedback mechanism is unavailable or wrong, such specially designed algorithms could suffer great loss. To avoid this problem, \citet{erez2021towards} try to optimize for both environments. However, they assume the feedback graphs are undirected and each vertex has a self-loop, which compromises the generality of the framework and may not be satisfied in applications. With a general feedback graph, the observation of an arm may not be available when this arm is pulled, which makes the exploration more expensive and the algorithms more challenging to perform optimally in both environments. In this work, we overcome this difficulty by a new trade-off mechanism with a carefully-designed proportion for exploration and exploitation. We prove the proposed algorithm simultaneously achieves $\mathrm{poly} \log T$ regret in the stochastic setting and minimax-optimal regret of $\tilde{O}(T^{2/3})$ in the adversarial setting where $T$ is the horizon and $\tilde{O}$ hides parameters independent of $T$ as well as logarithmic terms. To our knowledge, this is the first best-of-both-worlds result for general feedback graphs.
    Analysis and Extensions of Adversarial Training for Video Classification. (arXiv:2206.07953v1 [cs.CV])
    Adversarial training (AT) is a simple yet effective defense against adversarial attacks to image classification systems, which is based on augmenting the training set with attacks that maximize the loss. However, the effectiveness of AT as a defense for video classification has not been thoroughly studied. Our first contribution is to show that generating optimal attacks for video requires carefully tuning the attack parameters, especially the step size. Notably, we show that the optimal step size varies linearly with the attack budget. Our second contribution is to show that using a smaller (sub-optimal) attack budget at training time leads to a more robust performance at test time. Based on these findings, we propose three defenses against attacks with variable attack budgets. The first one, Adaptive AT, is a technique where the attack budget is drawn from a distribution that is adapted as training iterations proceed. The second, Curriculum AT, is a technique where the attack budget is increased as training iterations proceed. The third, Generative AT, further couples AT with a denoising generative adversarial network to boost robust performance. Experiments on the UCF101 dataset demonstrate that the proposed methods improve adversarial robustness against multiple attack types.
    A machine-generated catalogue of Charon's craters and implications for the Kuiper belt. (arXiv:2206.08277v1 [astro-ph.EP])
    In this paper we investigate Charon's craters size distribution using a deep learning model. This is motivated by the recent results of Singer et al. (2019) who, using manual cataloging, found a change in the size distribution slope of craters smaller than 12 km in diameter, translating into a paucity of small Kuiper Belt objects. These results were corroborated by Robbins and Singer (2021), but opposed by Morbidelli et al. (2021), necessitating an independent review. Our MaskRCNN-based ensemble of models was trained on Lunar, Mercurian, and Martian crater catalogues and both optical and digital elevation images. We use a robust image augmentation scheme to force the model to generalize and transfer-learn into icy objects. With no prior bias or exposure to Charon, our model find best fit slopes of q =-1.47+-0.33 for craters smaller than 10 km, and q =-2.91+-0.51 for craters larger than 15 km. These values indicate a clear change in slope around 15 km as suggested by Singer et al. (2019) and thus independently confirm their conclusions. Our slopes however are both slightly flatter than those found more recently by Robbins and Singer (2021). Our trained models and relevant codes are available online on github.com/malidib/ACID .
    Lifelong Wandering: A realistic few-shot online continual learning setting. (arXiv:2206.07932v1 [cs.CV])
    Online few-shot learning describes a setting where models are trained and evaluated on a stream of data while learning emerging classes. While prior work in this setting has achieved very promising performance on instance classification when learning from data-streams composed of a single indoor environment, we propose to extend this setting to consider object classification on a series of several indoor environments, which is likely to occur in applications such as robotics. Importantly, our setting, which we refer to as online few-shot continual learning, injects the well-studied issue of catastrophic forgetting into the few-shot online learning paradigm. In this work, we benchmark several existing methods and adapted baselines within our setting, and show there exists a trade-off between catastrophic forgetting and online performance. Our findings motivate the need for future work in this setting, which can achieve better online performance without catastrophic forgetting.
    Large-scale, multi-centre, multi-disease validation of an AI clinical tool for cine CMR analysis. (arXiv:2206.08137v1 [eess.IV])
    INTRODUCTION: Artificial intelligence (AI) has the potential to facilitate the automation of CMR analysis for biomarker extraction. However, most AI algorithms are trained on a specific input domain (e.g., single scanner vendor or hospital-tailored imaging protocol) and lack the robustness to perform optimally when applied to CMR data from other input domains. METHODS: Our proposed framework consists of an AI-based algorithm for biventricular segmentation of short-axis images, followed by a post-analysis quality control to detect erroneous results. The segmentation algorithm was trained on a large dataset of clinical CMR scans from two NHS hospitals (n=2793) and validated on additional cases from this dataset (n=441) and on five external datasets (n=6808). The validation data included CMR scans of patients with a range of diseases acquired at 12 different centres using CMR scanners from all major vendors. RESULTS: Our method yielded median Dice scores over 87%, translating into median absolute errors in cardiac biomarkers within the range of inter-observer variability: <8.4mL (left ventricle), <9.2mL (right ventricle), <13.3g (left ventricular mass), and <5.9% (ejection fraction) across all datasets. Stratification of cases according to phenotypes of cardiac disease and scanner vendors showed good agreement. CONCLUSIONS: We show that our proposed tool, which combines a state-of-the-art AI algorithm trained on a large-scale multi-domain CMR dataset with a post-analysis quality control, allows us to robustly deal with routine clinical data from multiple centres, vendors, and cardiac diseases. This is a fundamental step for the clinical translation of AI algorithms. Moreover, our method yields a range of additional biomarkers of cardiac function (filling and ejection rates, regional wall motion, and strain) at no extra computational cost.
    Barrier Certified Safety Learning Control: When Sum-of-Square Programming Meets Reinforcement Learning. (arXiv:2206.07915v1 [eess.SY])
    Safety guarantee is essential in many engineering implementations. Reinforcement learning provides a useful way to strengthen safety. However, reinforcement learning algorithms cannot completely guarantee safety over realistic operations. To address this issue, this work adopts control barrier functions over reinforcement learning, and proposes a compensated algorithm to completely maintain safety. Specifically, a sum-of-squares programming has been exploited to search for the optimal controller, and tune the learning hyperparameters simultaneously. Thus, the control actions are pledged to be always within the safe region. The effectiveness of proposed method is demonstrated via an inverted pendulum model. Compared to quadratic programming based reinforcement learning methods, our sum-of-squares programming based reinforcement learning has shown its superiority.
    Large-Scale Differentiable Causal Discovery of Factor Graphs. (arXiv:2206.07824v1 [stat.ML])
    A common theme in causal inference is learning causal relationships between observed variables, also known as causal discovery. This is usually a daunting task, given the large number of candidate causal graphs and the combinatorial nature of the search space. Perhaps for this reason, most research has so far focused on relatively small causal graphs, with up to hundreds of nodes. However, recent advances in fields like biology enable generating experimental data sets with thousands of interventions followed by rich profiling of thousands of variables, raising the opportunity and urgent need for large causal graph models. Here, we introduce the notion of factor directed acyclic graphs (f-DAGs) as a way to restrict the search space to non-linear low-rank causal interaction models. Combining this novel structural assumption with recent advances that bridge the gap between causal discovery and continuous optimization, we achieve causal discovery on thousands of variables. Additionally, as a model for the impact of statistical noise on this estimation procedure, we study a model of edge perturbations of the f-DAG skeleton based on random graphs and quantify the effect of such perturbations on the f-DAG rank. This theoretical analysis suggests that the set of candidate f-DAGs is much smaller than the whole DAG space and thus more statistically robust in the high-dimensional regime where the underlying skeleton is hard to assess. We propose Differentiable Causal Discovery of Factor Graphs (DCD-FG), a scalable implementation of f-DAG constrained causal discovery for high-dimensional interventional data. DCD-FG uses a Gaussian non-linear low-rank structural equation model and shows significant improvements compared to state-of-the-art methods in both simulations as well as a recent large-scale single-cell RNA sequencing data set with hundreds of genetic interventions.
    DCASE 2022: Comparative Analysis Of CNNs For Acoustic Scene Classification Under Low-Complexity Considerations. (arXiv:2206.08007v1 [cs.SD])
    Acoustic scene classification is an automatic listening problem that aims to assign an audio recording to a pre-defined scene based on its audio data. Over the years (and in past editions of the DCASE) this problem has often been solved with techniques known as ensembles (use of several machine learning models to combine their predictions in the inference phase). While these solutions can show performance in terms of accuracy, they can be very expensive in terms of computational capacity, making it impossible to deploy them in IoT devices. Due to the drift in this field of study, this task has two limitations in terms of model complexity. It should be noted that there is also the added complexity of mismatching devices (the audios provided are recorded by different sources of information). This technical report makes a comparative study of two different network architectures: conventional CNN and Conv-mixer. Although both networks exceed the baseline required by the competition, the conventional CNN shows a higher performance, exceeding the baseline by 8 percentage points. Solutions based on Conv-mixer architectures show worse performance although they are much lighter solutions.
    Disparate Impact in Differential Privacy from Gradient Misalignment. (arXiv:2206.07737v1 [cs.LG])
    As machine learning becomes more widespread throughout society, aspects including data privacy and fairness must be carefully considered, and are crucial for deployment in highly regulated industries. Unfortunately, the application of privacy enhancing technologies can worsen unfair tendencies in models. In particular, one of the most widely used techniques for private model training, differentially private stochastic gradient descent (DPSGD), frequently intensifies disparate impact on groups within data. In this work we study the fine-grained causes of unfairness in DPSGD and identify gradient misalignment due to inequitable gradient clipping as the most significant source. This observation leads us to a new method for reducing unfairness by preventing gradient misalignment in DPSGD.
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity or Independent Influences. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.
    Efficient Approximation of Expected Hypervolume Improvement using Gauss-Hermite Quadrature. (arXiv:2206.07834v1 [cs.LG])
    Many methods for performing multi-objective optimisation of computationally expensive problems have been proposed recently. Typically, a probabilistic surrogate for each objective is constructed from an initial dataset. The surrogates can then be used to produce predictive densities in the objective space for any solution. Using the predictive densities, we can compute the expected hypervolume improvement (EHVI) due to a solution. Maximising the EHVI, we can locate the most promising solution that may be expensively evaluated next. There are closed-form expressions for computing the EHVI, integrating over the multivariate predictive densities. However, they require partitioning the objective space, which can be prohibitively expensive for more than three objectives. Furthermore, there are no closed-form expressions for a problem where the predictive densities are dependent, capturing the correlations between objectives. Monte Carlo approximation is used instead in such cases, which is not cheap. Hence, the need to develop new accurate but cheaper approximation methods remains. Here we investigate an alternative approach toward approximating the EHVI using Gauss-Hermite quadrature. We show that it can be an accurate alternative to Monte Carlo for both independent and correlated predictive densities with statistically significant rank correlations for a range of popular test problems.
    Metric-Fair Classifier Derandomization. (arXiv:2206.07826v1 [cs.LG])
    We study the problem of \emph{classifier derandomization} in machine learning: given a stochastic binary classifier $f: X \to [0,1]$, sample a deterministic classifier $\hat{f}: X \to \{0,1\}$ that approximates the output of $f$ in aggregate over any data distribution. Recent work revealed how to efficiently derandomize a stochastic classifier with strong output approximation guarantees, but at the cost of individual fairness -- that is, if $f$ treated similar inputs similarly, $\hat{f}$ did not. In this paper, we initiate a systematic study of classifier derandomization with metric fairness guarantees. We show that the prior derandomization approach is almost maximally metric-unfair, and that a simple ``random threshold'' derandomization achieves optimal fairness preservation but with weaker output approximation. We then devise a derandomization procedure that provides an appealing tradeoff between these two: if $f$ is $\alpha$-metric fair according to a metric $d$ with a locality-sensitive hash (LSH) family, then our derandomized $\hat{f}$ is, with high probability, $O(\alpha)$-metric fair and a close approximation of $f$. We also prove generic results applicable to all (fair and unfair) classifier derandomization procedures, including a bias-variance decomposition and reductions between various notions of metric fairness.
    OmniMAE: Single Model Masked Pretraining on Images and Videos. (arXiv:2206.08356v1 [cs.CV])
    Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training.
    Spatially-Adaptive Multilayer Selection for GAN Inversion and Editing. (arXiv:2206.08357v1 [cs.CV])
    Existing GAN inversion and editing methods work well for aligned objects with a clean background, such as portraits and animal faces, but often struggle for more difficult categories with complex scene layouts and object occlusions, such as cars, animals, and outdoor images. We propose a new method to invert and edit such complex images in the latent space of GANs, such as StyleGAN2. Our key idea is to explore inversion with a collection of layers, spatially adapting the inversion process to the difficulty of the image. We learn to predict the "invertibility" of different image segments and project each segment into a latent layer. Easier regions can be inverted into an earlier layer in the generator's latent space, while more challenging regions can be inverted into a later feature space. Experiments show that our method obtains better inversion results compared to the recent approaches on complex categories, while maintaining downstream editability. Please refer to our project page at https://www.cs.cmu.edu/~SAMInversion.
    Research Topic Flows in Co-Authorship Networks. (arXiv:2206.07980v1 [cs.SI])
    In scientometrics, scientific collaboration is often analyzed by means of co-authorships. An aspect which is often overlooked and more difficult to quantify is the flow of expertise between authors from different research topics, which is an important part of scientific progress. With the Topic Flow Network (TFN) we propose a graph structure for the analysis of research topic flows between scientific authors and their respective research fields. Based on a multi-graph and a topic model, our proposed network structure accounts for intratopic as well as intertopic flows. Our method requires for the construction of a TFN solely a corpus of publications (i.e., author and abstract information). From this, research topics are discovered automatically through non-negative matrix factorization. The thereof derived TFN allows for the application of social network analysis techniques, such as common metrics and community detection. Most importantly, it allows for the analysis of intertopic flows on a large, macroscopic scale, i.e., between research topic, as well as on a microscopic scale, i.e., between certain sets of authors. We demonstrate the utility of TFNs by applying our method to two comprehensive corpora of altogether 20 Mio. publications spanning more than 60 years of research in the fields computer science and mathematics. Our results give evidence that TFNs are suitable, e.g., for the analysis of topical communities, the discovery of important authors in different fields, and, most notably, the analysis of intertopic flows, i.e., the transfer of topical expertise. Besides that, our method opens new directions for future research, such as the investigation of influence relationships between research fields.
    Automated analysis of continuum fields from atomistic simulations using statistical machine learning. (arXiv:2206.08048v1 [cond-mat.mtrl-sci])
    Atomistic simulations of the molecular dynamics/statics kind are regularly used to study small scale plasticity. Contemporary simulations are performed with tens to hundreds of millions of atoms, with snapshots of these configurations written out at regular intervals for further analysis. Continuum scale constitutive models for material behavior can benefit from information on the atomic scale, in particular in terms of the deformation mechanisms, the accommodation of the total strain and partitioning of stress and strain fields in individual grains. In this work we develop a methodology using statistical data mining and machine learning algorithms to automate the analysis of continuum field variables in atomistic simulations. We focus on three important field variables: total strain, elastic strain and microrotation. Our results show that the elastic strain in individual grains exhibits a unimodal log-normal distribution, whilst the total strain and microrotation fields evidence a multimodal distribution. The peaks in the distribution of total strain are identified with a Gaussian mixture model and methods to circumvent overfitting problems are presented. Subsequently, we evaluate the identified peaks in terms of deformation mechanisms in a grain, which e.g., helps to quantify the strain for which individual deformation mechanisms are responsible. The overall statistics of the distributions over all grains are an important input for higher scale models, which ultimately also helps to be able to quantitatively discuss the implications for information transfer to phenomenological models.
    Noisy Learning for Neural ODEs Acts as a Robustness Locus Widening. (arXiv:2206.08237v1 [cs.LG])
    We investigate the problems and challenges of evaluating the robustness of Differential Equation-based (DE) networks against synthetic distribution shifts. We propose a novel and simple accuracy metric which can be used to evaluate intrinsic robustness and to validate dataset corruption simulators. We also propose methodology recommendations, destined for evaluating the many faces of neural DEs' robustness and for comparing them with their discrete counterparts rigorously. We then use this criteria to evaluate a cheap data augmentation technique as a reliable way for demonstrating the natural robustness of neural ODEs against simulated image corruptions across multiple datasets.
    ProGNNosis: A Data-driven Model to Predict GNN Computation Time Using Graph Metrics. (arXiv:2206.08258v1 [cs.LG])
    Graph Neural Networks (GNN) show great promise in problems dealing with graph-structured data. One of the unique points of GNNs is their flexibility to adapt to multiple problems, which not only leads to wide applicability, but also poses important challenges when finding the best model or acceleration technique for a particular problem. An example of such challenges resides in the fact that the accuracy or effectiveness of a GNN model or acceleration technique generally depends on the structure of the underlying graph. In this paper, in an attempt to address the problem of graph-dependent acceleration, we propose ProGNNosis, a data-driven model that can predict the GNN training time of a given GNN model running over a graph of arbitrary characteristics by inspecting the input graph metrics. Such prediction is made based on a regression that was previously trained offline using a diverse synthetic graph dataset. In practice, our method allows making informed decisions on which design to use for a specific problem. In the paper, the methodology to build ProGNNosis is defined and applied for a specific use case, where it helps to decide which graph representation is better. Our results show that ProGNNosis helps achieve an average speedup of 1.22X over randomly selecting a graph representation in multiple widely used GNN models such as GCN, GIN, GAT, or GraphSAGE.
    Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History. (arXiv:2206.08039v1 [cs.SD])
    We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for predicting appropriate dialogue context. As such, it can be regarded as an extension of the conventional linguistic-feature-based dialogue history modeling. To train the empathetic DSS model effectively, we investigate 1) a self-supervised learning model pretrained with large speech corpora, 2) a style-guided training using a prosody embedding of the current utterance to be predicted by the dialogue context embedding, 3) a cross-modal attention to combine text and speech modalities, and 4) a sentence-wise embedding to achieve fine-grained prosody modeling rather than utterance-wise modeling. The evaluation results demonstrate that 1) simply considering prosodic contexts of the dialogue history does not improve the quality of speech in empathetic DSS and 2) introducing style-guided training and sentence-wise embedding modeling achieves higher speech quality than that by the conventional method.
    On Private Online Convex Optimization: Optimal Algorithms in $\ell_p$-Geometry and High Dimensional Contextual Bandits. (arXiv:2206.08111v1 [cs.LG])
    Differentially private (DP) stochastic convex optimization (SCO) is ubiquitous in trustworthy machine learning algorithm design. This paper studies the DP-SCO problem with streaming data sampled from a distribution and arrives sequentially. We also consider the continual release model where parameters related to private information are updated and released upon each new data, often known as the online algorithms. Despite that numerous algorithms have been developed to achieve the optimal excess risks in different $\ell_p$ norm geometries, yet none of the existing ones can be adapted to the streaming and continual release setting. To address such a challenge as the online convex optimization with privacy protection, we propose a private variant of online Frank-Wolfe algorithm with recursive gradients for variance reduction to update and reveal the parameters upon each data. Combined with the adaptive differential privacy analysis, our online algorithm achieves in linear time the optimal excess risk when $1<p\leq 2$ and the state-of-the-art excess risk meeting the non-private lower ones when $2<p\leq\infty$. Our algorithm can also be extended to the case $p=1$ to achieve nearly dimension-independent excess risk. While previous variance reduction results on recursive gradient have theoretical guarantee only in the independent and identically distributed sample setting, we establish such a guarantee in a non-stationary setting. To demonstrate the virtues of our method, we design the first DP algorithm for high-dimensional generalized linear bandits with logarithmic regret. Comparative experiments with a variety of DP-SCO and DP-Bandit algorithms exhibit the efficacy and utility of the proposed algorithms.
    Attention-wise masked graph contrastive learning for predicting molecular property. (arXiv:2206.08262v1 [q-bio.BM])
    Accurate and efficient prediction of the molecular properties of drugs is one of the fundamental problems in drug research and development. Recent advancements in representation learning have been shown to greatly improve the performance of molecular property prediction. However, due to limited labeled data, supervised learning-based molecular representation algorithms can only search limited chemical space, which results in poor generalizability. In this work, we proposed a self-supervised representation learning framework for large-scale unlabeled molecules. We developed a novel molecular graph augmentation strategy, referred to as attention-wise graph mask, to generate challenging positive sample for contrastive learning. We adopted the graph attention network (GAT) as the molecular graph encoder, and leveraged the learned attention scores as masking guidance to generate molecular augmentation graphs. By minimization of the contrastive loss between original graph and masked graph, our model can capture important molecular structure and higher-order semantic information. Extensive experiments showed that our attention-wise graph mask contrastive learning exhibit state-of-the-art performance in a couple of downstream molecular property prediction tasks.
    BYOL-Explore: Exploration by Bootstrapped Prediction. (arXiv:2206.08332v1 [cs.LG])
    We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments. On this benchmark, we solve the majority of the tasks purely through augmenting the extrinsic reward with BYOL-Explore s intrinsic reward, whereas prior work could only get off the ground with human demonstrations. As further evidence of the generality of BYOL-Explore, we show that it achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
    Continual Learning with Guarantees via Weight Interval Constraints. (arXiv:2206.07996v1 [cs.LG])
    We introduce a new training paradigm that enforces interval constraints on neural network parameter space to control forgetting. Contemporary Continual Learning (CL) methods focus on training neural networks efficiently from a stream of data, while reducing the negative impact of catastrophic forgetting, yet they do not provide any firm guarantees that network performance will not deteriorate uncontrollably over time. In this work, we show how to put bounds on forgetting by reformulating continual learning of a model as a continual contraction of its parameter space. To that end, we propose Hyperrectangle Training, a new training methodology where each task is represented by a hyperrectangle in the parameter space, fully contained in the hyperrectangles of the previous tasks. This formulation reduces the NP-hard CL problem back to polynomial time while providing full resilience against forgetting. We validate our claim by developing InterContiNet (Interval Continual Learning) algorithm which leverages interval arithmetic to effectively model parameter regions as hyperrectangles. Through experimental results, we show that our approach performs well in a continual learning setup without storing data from previous tasks.
    Gradient-Based Adversarial and Out-of-Distribution Detection. (arXiv:2206.08255v1 [cs.LG])
    We propose to utilize gradients for detecting adversarial and out-of-distribution samples. We introduce confounding labels -- labels that differ from normal labels seen during training -- in gradient generation to probe the effective expressivity of neural networks. Gradients depict the amount of change required for a model to properly represent given inputs, providing insight into the representational power of the model established by network architectural properties as well as training data. By introducing a label of different design, we remove the dependency on ground truth labels for gradient generation during inference. We show that our gradient-based approach allows for capturing the anomaly in inputs based on the effective expressivity of the models with no hyperparameter tuning or additional processing, and outperforms state-of-the-art methods for adversarial and out-of-distribution detection.
    Know your audience: specializing grounded language models with the game of Dixit. (arXiv:2206.08349v1 [cs.LG])
    Effective communication requires adapting to the idiosyncratic common ground shared with each communicative partner. We study a particularly challenging instantiation of this problem: the popular game Dixit. We formulate a round of Dixit as a multi-agent image reference game where a (trained) speaker model is rewarded for describing a target image such that one (pretrained) listener model can correctly identify it from a pool of distractors, but another listener cannot. To adapt to this setting, the speaker must exploit differences in the common ground it shares with the different listeners. We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization from rewards only, without direct supervision. In a series of controlled experiments, we show that the speaker can adapt according to the idiosyncratic strengths and weaknesses of various pairs of different listeners. Furthermore, we show zero-shot transfer of the speaker's specialization to unseen real-world data. Our experiments offer a step towards adaptive communication in complex multi-partner settings and highlight the interesting research challenges posed by games like Dixit. We hope that our work will inspire creative new approaches to adapting pretrained models.
    Unsupervised Space Partitioning for Nearest Neighbor Search. (arXiv:2206.08091v1 [cs.LG])
    Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. In this paper, we propose an end-to-end learning framework that couples the partitioning (one key step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the key limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given partition of the data space, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. Without loss of generality, our unsupervised partitioning approach is shown as a promising alternative to many widely used clustering methods like K-means clustering and DBSCAN.
    FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems. (arXiv:2206.07796v1 [cs.SE])
    Source code repositories consist of large codebases, often containing error-prone programs. The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing these defects. Various methods exist to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible solutions for a particular bug, there are not many tools and datasets available to evaluate generated code effectively. In this work, we introduce FixEval, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. We introduce a rich test suite to evaluate and assess the correctness of model-generated program fixes. We consider two Transformer language models pretrained on programming languages as our baselines, and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately, while execution-based methods evaluate programs through all cases and scenarios specifically designed for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation.
    Beyond Adult and COMPAS: Fairness in Multi-Class Prediction. (arXiv:2206.07801v1 [cs.LG])
    We consider the problem of producing fair probabilistic classifiers for multi-class classification tasks. We formulate this problem in terms of "projecting" a pre-trained (and potentially unfair) classifier onto the set of models that satisfy target group-fairness requirements. The new, projected model is given by post-processing the outputs of the pre-trained classifier by a multiplicative factor. We provide a parallelizable iterative algorithm for computing the projected classifier and derive both sample complexity and convergence guarantees. Comprehensive numerical comparisons with state-of-the-art benchmarks demonstrate that our approach maintains competitive performance in terms of accuracy-fairness trade-off curves, while achieving favorable runtime on large datasets. We also evaluate our method at scale on an open dataset with multiple classes, multiple intersectional protected groups, and over 1M samples.
    Introducing the Huber mechanism for differentially private low-rank matrix completion. (arXiv:2206.07910v1 [cs.CR])
    Performing low-rank matrix completion with sensitive user data calls for privacy-preserving approaches. In this work, we propose a novel noise addition mechanism for preserving differential privacy where the noise distribution is inspired by Huber loss, a well-known loss function in robust statistics. The proposed Huber mechanism is evaluated against existing differential privacy mechanisms while solving the matrix completion problem using the Alternating Least Squares approach. We also propose using the Iteratively Re-Weighted Least Squares algorithm to complete low-rank matrices and study the performance of different noise mechanisms in both synthetic and real datasets. We prove that the proposed mechanism achieves {\epsilon}-differential privacy similar to the Laplace mechanism. Furthermore, empirical results indicate that the Huber mechanism outperforms Laplacian and Gaussian in some cases and is comparable, otherwise.
    On the Surprising Behaviour of node2vec. (arXiv:2206.08252v1 [cs.LG])
    Graph embedding techniques are a staple of modern graph learning research. When using embeddings for downstream tasks such as classification, information about their stability and robustness, i.e., their susceptibility to sources of noise, stochastic effects, or specific parameter choices, becomes increasingly important. As one of the most prominent graph embedding schemes, we focus on node2vec and analyse its embedding quality from multiple perspectives. Our findings indicate that embedding quality is unstable with respect to parameter choices, and we propose strategies to remedy this in practice.
    Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks. (arXiv:2206.07741v1 [cs.LG])
    The large computing and memory cost of deep neural networks (DNNs) often precludes their use in resource-constrained devices. Quantizing the parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference, facilitating the use of DNNs on edge computing platforms. Recent efforts at quantizing DNNs have employed a range of techniques encompassing progressive quantization, step-size adaptation, and gradient scaling. This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing. Our method establishes a new pareto frontier in model accuracy and memory footprint demonstrating a range of quantized models, delivering best-in-class accuracy below 4.3 MB of weights (wgts.) and activations (acts.). Our main contributions are: (i) hardware-aware heterogeneous differentiable quantization with tensor-sliced learned precision, (ii) targeted gradient modification for wgts. and acts. to mitigate quantization errors, and (iii) a multi-phase learning schedule to address instability in learning arising from updates to the learned quantizer and model parameters. We demonstrate the effectiveness of our techniques on the ImageNet dataset across a range of models including EfficientNet-Lite0 (e.g., 4.14MB of wgts. and acts. at 67.66% accuracy) and MobileNetV2 (e.g., 3.51MB wgts. and acts. at 65.39% accuracy).
    Adaptive Expert Models for Personalization in Federated Learning. (arXiv:2206.07832v1 [cs.LG])
    Federated Learning (FL) is a promising framework for distributed learning when data is private and sensitive. However, the state-of-the-art solutions in this framework are not optimal when data is heterogeneous and non-Independent and Identically Distributed (non-IID). We propose a practical and robust approach to personalization in FL that adjusts to heterogeneous and non-IID data by balancing exploration and exploitation of several global models. To achieve our aim of personalization, we use a Mixture of Experts (MoE) that learns to group clients that are similar to each other, while using the global models more efficiently. We show that our approach achieves an accuracy up to 29.78 % and up to 4.38 % better compared to a local model in a pathological non-IID setting, even though we tune our approach in the IID setting.
    Explainable Models via Compression of Tree Ensembles. (arXiv:2206.07904v1 [cs.LG])
    Ensemble models (bagging and gradient-boosting) of relational decision trees have proved to be one of the most effective learning methods in the area of probabilistic logic models (PLMs). While effective, they lose one of the most important aspect of PLMs -- interpretability. In this paper we consider the problem of compressing a large set of learned trees into a single explainable model. To this effect, we propose CoTE -- Compression of Tree Ensembles -- that produces a single small decision list as a compressed representation. CoTE first converts the trees to decision lists and then performs the combination and compression with the aid of the original training set. An experimental evaluation demonstrates the effectiveness of CoTE in several benchmark relational data sets.
    Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation. (arXiv:2206.08366v1 [cs.LG])
    Bayesian Optimization (BO) has shown great promise for the global optimization of functions that are expensive to evaluate, but despite many successes, standard approaches can struggle in high dimensions. To improve the performance of BO, prior work suggested incorporating gradient information into a Gaussian process surrogate of the objective, giving rise to kernel matrices of size $nd \times nd$ for $n$ observations in $d$ dimensions. Na\"ively multiplying with (resp. inverting) these matrices requires $\mathcal{O}(n^2d^2)$ (resp. $\mathcal{O}(n^3d^3$)) operations, which becomes infeasible for moderate dimensions and sample sizes. Here, we observe that a wide range of kernels gives rise to structured matrices, enabling an exact $\mathcal{O}(n^2d)$ matrix-vector multiply for gradient observations and $\mathcal{O}(n^2d^2)$ for Hessian observations. Beyond canonical kernel classes, we derive a programmatic approach to leveraging this type of structure for transformations and combinations of the discussed kernel classes, which constitutes a structure-aware automatic differentiation algorithm. Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels without any additional derivations, enabling flexible, problem-dependent modeling while scaling first-order BO to high $d$.
    Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization. (arXiv:2206.07882v1 [cs.CL])
    We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T). We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model (acoustic encoder and language model) and achieve near-iso-accuracy. We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance while limiting the computational overhead of QAT. Density ratio Language Model fusion has shown remarkable accuracy gains on RNN-T workloads but it severely increases the computational cost of inference. We show that our quantization strategies enable using large beam widths for hypothesis search while achieving streaming-compatible runtimes and a full model compression ratio of 7.6$\times$ compared to the full precision model. Via hardware simulations, we estimate a 3.4$\times$ acceleration from FP16 to INT4 for the end-to-end quantized RNN-T inclusive of LM fusion, resulting in a Real Time Factor (RTF) of 0.06. On the NIST Hub5 2000, Hub5 2001, and RT-03 test sets, we retain most of the gains associated with LM fusion, improving the average WER by $>$1.5%.
    Double Sampling Randomized Smoothing. (arXiv:2206.07912v1 [cs.LG])
    Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as previous work shows, the certified robust radius in randomized smoothing suffers from scaling to large datasets ("curse of dimensionality"). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius under $\ell_2$ norm where $d$ is the input dimension, which implies that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and ImageNet verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https://github.com/llylly/DSRS.
    Taxonomy of Benchmarks in Graph Representation Learning. (arXiv:2206.07729v1 [cs.LG])
    Graph Neural Networks (GNNs) extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNN models with superior performance according to a collection of graph representation learning benchmarks, it is currently not well understood what aspects of a given model are probed by them. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a $\textit{sensitivity profile}$ that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in $\texttt{GTaxoGym}$ package are extendable to multiple graph prediction task types and future datasets.
    Interaction-Grounded Learning with Action-inclusive Feedback. (arXiv:2206.08364v1 [cs.LG])
    Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.
    Active Nearest Neighbor Regression Through Delaunay Refinement. (arXiv:2206.08061v1 [cs.LG])
    We introduce an algorithm for active function approximation based on nearest neighbor regression. Our Active Nearest Neighbor Regressor (ANNR) relies on the Voronoi-Delaunay framework from computational geometry to subdivide the space into cells with constant estimated function value and select novel query points in a way that takes the geometry of the function graph into account. We consider the recent state-of-the-art active function approximator called DEFER, which is based on incremental rectangular partitioning of the space, as the main baseline. The ANNR addresses a number of limitations that arise from the space subdivision strategy used in DEFER. We provide a computationally efficient implementation of our method, as well as theoretical halting guarantees. Empirical results show that ANNR outperforms the baseline for both closed-form functions and real-world examples, such as gravitational wave parameter inference and exploration of the latent space of a generative model.
    Adversarial Patch Attacks and Defences in Vision-Based Tasks: A Survey. (arXiv:2206.08304v1 [cs.CV])
    Adversarial attacks in deep learning models, especially for safety-critical systems, are gaining more and more attention in recent years, due to the lack of trust in the security and robustness of AI models. Yet the more primitive adversarial attacks might be physically infeasible or require some resources that are hard to access like the training data, which motivated the emergence of patch attacks. In this survey, we provide a comprehensive overview to cover existing techniques of adversarial patch attacks, aiming to help interested researchers quickly catch up with the progress in this field. We also discuss existing techniques for developing detection and defences against adversarial patches, aiming to help the community better understand this field and its applications in the real world.
    Boosting the Adversarial Transferability of Surrogate Model with Dark Knowledge. (arXiv:2206.08316v1 [cs.LG])
    Deep neural networks (DNNs) for image classification are known to be vulnerable to adversarial examples. And, the adversarial examples have transferability, which means an adversarial example for a DNN model can fool another black-box model with a non-trivial probability. This gave birth of the transfer-based adversarial attack where the adversarial examples generated by a pretrained or known model (called surrogate model) are used to conduct black-box attack. There are some work on how to generate the adversarial examples from a given surrogate model to achieve better transferability. However, training a special surrogate model to generate adversarial examples with better transferability is relatively under-explored. In this paper, we propose a method of training a surrogate model with abundant dark knowledge to boost the adversarial transferability of the adversarial examples generated by the surrogate model. This trained surrogate model is named dark surrogate model (DSM), and the proposed method to train DSM consists of two key components: a teacher model extracting dark knowledge and providing soft labels, and the mixing augmentation skill which enhances the dark knowledge of training data. Extensive experiments have been conducted to show that the proposed method can substantially improve the adversarial transferability of surrogate model across different architectures of surrogate model and optimizers for generating adversarial examples. We also show that the proposed method can be applied to other scenarios of transfer-based attack that contain dark knowledge, like face verification.
    The convergent Indian buffet process. (arXiv:2206.08002v1 [stat.ML])
    We propose a new Bayesian nonparametric prior for latent feature models, which we call the convergent Indian buffet process (CIBP). We show that under the CIBP, the number of latent features is distributed as a Poisson distribution with the mean monotonically increasing but converging to a certain value as the number of objects goes to infinity. That is, the expected number of features is bounded above even when the number of objects goes to infinity, unlike the standard Indian buffet process under which the expected number of features increases with the number of objects. We provide two alternative representations of the CIBP based on a hierarchical distribution and a completely random measure, respectively, which are of independent interest. The proposed CIBP is assessed on a high-dimensional sparse factor model.
    Learning Physics between Digital Twins with Low-Fidelity Models and Physics-Informed Gaussian Processes. (arXiv:2206.08201v1 [stat.ML])
    A digital twin is a computer model that represents an individual, for example, a component, a patient or a process. In many situations, we want to gain knowledge about an individual from its data while incorporating imperfect physical knowledge and also learn from data from other individuals. In this paper, we introduce and demonstrate a fully Bayesian methodology for learning between digital twins in a setting where the physical parameters of each individual are of interest. For each individual, the methodology is based on Bayesian calibration with model discrepancy. Through the discrepancy, modelled as a Gaussian process, the imperfect low-fidelity physical model is accounted for. Using ideas from Bayesian hierarchical models, a joint probabilistic model of digital twins is constructed by connecting them through a new level in the hierarchy. For the physical parameters, the methodology can be seen as using a prior distribution in the individual model that is the posterior of the corresponding hyperparameter in the joint model. For learning the imperfect physics between individuals two approaches are introduced, one that assumes the same discrepancy for all individuals and one that can be seen as using a prior learned from all individuals for the parameters of the Gaussian processes representing the discrepancies. Based on recent advances related to physics-informed priors, Hamiltonian Monte Carlo methods and using these for inverse problems we set up an inference methodology that allows our approach to be computational feasible also for physical models based on partial differential equations and individual data that are not aligned. The methodology is demonstrated in two synthetic case studies, a toy example previously used in the literature extended to more individuals and an example based on a cardiovascular differential equation model relevant for the treatment of hypertension.
    MoDi: Unconditional Motion Synthesis from Diverse Data. (arXiv:2206.08010v1 [cs.GR])
    The emergence of neural networks has revolutionized the field of motion synthesis. Yet, learning to unconditionally synthesize motions from a given distribution remains a challenging task, especially when the motions are highly diverse. We present MoDi, an unconditional generative model that synthesizes diverse motions. Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset and yields a well-behaved, highly semantic latent space. The design of our model follows the prolific architecture of StyleGAN and adapts two of its key technical components into the motion domain: a set of style-codes injected into each level of the generator hierarchy and a mapping function that learns and forms a disentangled latent space. We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered, and facilitates semantic editing and motion interpolation. In addition, we propose a technique to invert unseen motions into the latent space, and demonstrate latent-based motion editing operations that otherwise cannot be achieved by naive manipulation of explicit motion representations. Our qualitative and quantitative experiments show that our framework achieves state-of-the-art synthesis quality that can follow the distribution of highly diverse motion datasets. Code and trained models will be released at https://sigal-raab.github.io/MoDi.
    A Contextual Combinatorial Semi-Bandit Approach to Network Bottleneck Identification. (arXiv:2206.08144v1 [cs.LG])
    Bottleneck identification is a challenging task in network analysis, especially when the network is not fully specified. To address this task, we develop a unified online learning framework based on combinatorial semi-bandits that performs bottleneck identification alongside learning the specifications of the underlying network. Within this framework, we adapt and investigate several combinatorial semi-bandit methods such as epsilon-greedy, LinUCB, BayesUCB, and Thompson Sampling. Our framework is able to employ contextual information in the form of contextual bandits. We evaluate our framework on the real-world application of road networks and demonstrate its effectiveness in different settings.
    A Truthful Owner-Assisted Scoring Mechanism. (arXiv:2206.08149v1 [cs.LG])
    Alice (owner) has knowledge of the underlying quality of her items measured in grades. Given the noisy grades provided by an independent party, can Bob (appraiser) obtain accurate estimates of the ground-truth grades of the items by asking Alice a question about the grades? We address this when the payoff to Alice is additive convex utility over all her items. We establish that if Alice has to truthfully answer the question so that her payoff is maximized, the question must be formulated as pairwise comparisons between her items. Next, we prove that if Alice is required to provide a ranking of her items, which is the most fine-grained question via pairwise comparisons, she would be truthful. By incorporating the ground-truth ranking, we show that Bob can obtain an estimator with the optimal squared error in certain regimes based on any possible way of truthful information elicitation. Moreover, the estimated grades are substantially more accurate than the raw grades when the number of items is large and the raw grades are very noisy. Finally, we conclude the paper with several extensions and some refinements for practical considerations.
    Fault-Tolerant Collaborative Inference through the Edge-PRUNE Framework. (arXiv:2206.08152v1 [cs.LG])
    Collaborative inference has received significant research interest in machine learning as a vehicle for distributing computation load, reducing latency, as well as addressing privacy preservation in communications. Recent collaborative inference frameworks have adopted dynamic inference methodologies such as early-exit and run-time partitioning of neural networks. However, as machine learning frameworks scale in the number of inference inputs, e.g., in surveillance applications, fault tolerance related to device failure needs to be considered. This paper presents the Edge-PRUNE distributed computing framework, built on a formally defined model of computation, which provides a flexible infrastructure for fault tolerant collaborative inference. The experimental section of this work shows results on achievable inference time savings by collaborative inference, presents fault tolerant system topologies and analyzes their cost in terms of execution time overhead.
    Gradient Descent for Low-Rank Functions. (arXiv:2206.08257v1 [cs.LG])
    Several recent empirical studies demonstrate that important machine learning tasks, e.g., training deep neural networks, exhibit low-rank structure, where the loss function varies significantly in only a few directions of the input space. In this paper, we leverage such low-rank structure to reduce the high computational cost of canonical gradient-based methods such as gradient descent (GD). Our proposed \emph{Low-Rank Gradient Descent} (LRGD) algorithm finds an $\epsilon$-approximate stationary point of a $p$-dimensional function by first identifying $r \leq p$ significant directions, and then estimating the true $p$-dimensional gradient at every iteration by computing directional derivatives only along those $r$ directions. We establish that the "directional oracle complexities" of LRGD for strongly convex and non-convex objective functions are $\mathcal{O}(r \log(1/\epsilon) + rp)$ and $\mathcal{O}(r/\epsilon^2 + rp)$, respectively. When $r \ll p$, these complexities are smaller than the known complexities of $\mathcal{O}(p \log(1/\epsilon))$ and $\mathcal{O}(p/\epsilon^2)$ of {\gd} in the strongly convex and non-convex settings, respectively. Thus, LRGD significantly reduces the computational cost of gradient-based methods for sufficiently low-rank functions. In the course of our analysis, we also formally define and characterize the classes of exact and approximately low-rank functions.
    Constrained Submodular Optimization for Vaccine Design. (arXiv:2206.08336v1 [q-bio.QM])
    Advances in machine learning have enabled the prediction of immune system responses to prophylactic and therapeutic vaccines. However, the engineering task of designing vaccines remains a challenge. In particular, the genetic variability of the human immune system makes it difficult to design peptide vaccines that provide widespread immunity in vaccinated populations. We introduce a framework for evaluating and designing peptide vaccines that uses probabilistic machine learning models, and demonstrate its ability to produce designs for a SARS-CoV-2 vaccine that outperform previous designs. We provide a theoretical analysis of the approximability, scalability, and complexity of our framework.
    Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning. (arXiv:2206.08307v1 [cs.LG])
    We study the asynchronous stochastic gradient descent algorithm for distributed training over $n$ workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay $\tau_{\max}$ and show that an $\epsilon$-stationary point is reached after $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{\max}\epsilon^{-1}\right)$ iterations, where $\sigma$ denotes the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \sqrt{\tau_{\max}\tau_{avg}}\epsilon^{-1}\right)$ without any change in the algorithm where $\tau_{avg}$ is the average delay, which can be significantly smaller than $\tau_{\max}$. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of $\mathcal{O}\!\left(\sigma^2\epsilon^{-2}+ \tau_{avg}\epsilon^{-1}\right)$, and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker.
    Generalized Leverage Scores: Geometric Interpretation and Applications. (arXiv:2206.08054v1 [cs.LG])
    In problems involving matrix computations, the concept of leverage has found a large number of applications. In particular, leverage scores, which relate the columns of a matrix to the subspaces spanned by its leading singular vectors, are helpful in revealing column subsets to approximately factorize a matrix with quality guarantees. As such, they provide a solid foundation for a variety of machine-learning methods. In this paper we extend the definition of leverage scores to relate the columns of a matrix to arbitrary subsets of singular vectors. We establish a precise connection between column and singular-vector subsets, by relating the concepts of leverage scores and principal angles between subspaces. We employ this result to design approximation algorithms with provable guarantees for two well-known problems: generalized column subset selection and sparse canonical correlation analysis. We run numerical experiments to provide further insight on the proposed methods. The novel bounds we derive improve our understanding of fundamental concepts in matrix approximations. In addition, our insights may serve as building blocks for further contributions.
    SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation. (arXiv:2206.08367v1 [cs.CV])
    Adapting to a continuously evolving environment is a safety-critical challenge inevitably faced by all autonomous driving systems. Existing image and video driving datasets, however, fall short of capturing the mutable nature of the real world. In this paper, we introduce the largest multi-task synthetic dataset for autonomous driving, SHIFT. It presents discrete and continuous shifts in cloudiness, rain and fog intensity, time of day, and vehicle and pedestrian density. Featuring a comprehensive sensor suite and annotations for several mainstream perception tasks, SHIFT allows investigating the degradation of a perception system performance at increasing levels of domain shift, fostering the development of continuous adaptation strategies to mitigate this problem and assess model robustness and generality. Our dataset and benchmark toolkit are publicly available at www.vis.xyz/shift.
    Rank the triplets: A ranking-based multiple instance learning framework for detecting HPV infection in head and neck cancers using routine H&E images. (arXiv:2206.08275v1 [cs.CV])
    The aetiology of head and neck squamous cell carcinoma (HNSCC) involves multiple carcinogens such as alcohol, tobacco and infection with human papillomavirus (HPV). As the HPV infection influences the prognosis, treatment and survival of patients with HNSCC, it is important to determine the HPV status of these tumours. In this paper, we propose a novel triplet-ranking loss function and a multiple instance learning pipeline for HPV status prediction. This achieves a new state-of-the-art performance in HPV detection using only the routine H&E stained WSIs on two HNSCC cohorts. Furthermore, a comprehensive tumour microenvironment profiling was performed, which characterised the unique patterns between HPV+/- HNSCC from genomic, immunology and cellular perspectives. Positive correlations of the proposed score with different subtypes of T cells (e.g. T cells follicular helper, CD8+ T cells), and negative correlations with macrophages and connective cells (e.g. fibroblast) were identified, which is in line with clinical findings. Unique gene expression profiles were also identified with respect to HPV infection status, and is in line with existing findings.
    Deep Neural Imputation: A Framework for Recovering Incomplete Brain Recordings. (arXiv:2206.08094v1 [cs.LG])
    Neuroscientists and neuroengineers have long relied on multielectrode neural recordings to study the brain. However, in a typical experiment, many factors corrupt neural recordings from individual electrodes, including electrical noise, movement artifacts, and faulty manufacturing. Currently, common practice is to discard these corrupted recordings, reducing already limited data that is difficult to collect. To address this challenge, we propose Deep Neural Imputation (DNI), a framework to recover missing values from electrodes by learning from data collected across spatial locations, days, and participants. We explore our framework with a linear nearest-neighbor approach and two deep generative autoencoders, demonstrating DNI's flexibility. One deep autoencoder models participants individually, while the other extends this architecture to model many participants jointly. We evaluate our models across 12 human participants implanted with multielectrode intracranial electrocorticography arrays; participants had no explicit task and behaved naturally across hundreds of recording hours. We show that DNI recovers not only time series but also frequency content, and further establish DNI's practical value by recovering significant performance on a scientifically-relevant downstream neural decoding task.
    General Cyclical Training of Neural Networks. (arXiv:2202.08835v2 [cs.LG] UPDATED)
    This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at \url{https://github.com/lnsmith54/CFL}.
    Active Learning on a Budget: Opposite Strategies Suit High and Low Budgets. (arXiv:2202.02794v4 [cs.LG] UPDATED)
    Investigating active learning, we focus on the relation between the number of labeled examples (budget size), and suitable querying strategies. Our theoretical analysis shows a behavior reminiscent of phase transition: typical examples are best queried when the budget is low, while unrepresentative examples are best queried when the budget is large. Combined evidence shows that a similar phenomenon occurs in common classification models. Accordingly, we propose TypiClust -- a deep active learning strategy suited for low budgets. In a comparative empirical investigation of supervised learning, using a variety of architectures and image datasets, TypiClust outperforms all other active learning strategies in the low-budget regime. Using TypiClust in the semi-supervised framework, performance gets an even more significant boost. In particular, state-of-the-art semi-supervised methods trained on CIFAR-10 with 10 labeled examples selected by TypiClust, reach 93.2% accuracy -- an improvement of 39.4% over random selection. Code is available at https://github.com/avihu111/TypiClust.
    Data-Free Adversarial Knowledge Distillation for Graph Neural Networks. (arXiv:2205.03811v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been widely used in modeling graph structured data, owing to its impressive performance in a wide range of practical applications. Recently, knowledge distillation (KD) for GNNs has enabled remarkable progress in graph model compression and knowledge transfer. However, most of the existing KD methods require a large volume of real data, which are not readily available in practice, and may preclude their applicability in scenarios where the teacher model is trained on rare or hard to acquire datasets. To address this problem, we propose the first end-to-end framework for data-free adversarial knowledge distillation on graph structured data (DFAD-GNN). To be specific, our DFAD-GNN employs a generative adversarial network, which mainly consists of three components: a pre-trained teacher model and a student model are regarded as two discriminators, and a generator is utilized for deriving training graphs to distill knowledge from the teacher model into the student model. Extensive experiments on various benchmark models and six representative datasets demonstrate that our DFAD-GNN significantly surpasses state-of-the-art data-free baselines in the graph classification task.
    Cyclocopula Technique to Study the Relationship Between Two Cyclostationary Time Series with Fractional Brownian Motion Errors. (arXiv:2206.07976v1 [stat.ME])
    Detection of the relationship between two time series is so important in environmental and hydrological studies. Several parametric and non-parametric approaches can be applied to detect relationships. These techniques are usually sensitive to stationarity assumptions. In this research, a new copula-based method is introduced to detect the relationship between two cylostationary time series with fractional Brownian motion (fBm) errors. The numerical studies verify the performance of the introduced approach.
    User Engagement and Churn in Mobile Health Applications. (arXiv:2206.08178v1 [stat.ML])
    Mobile health apps are revolutionizing the healthcare ecosystem by improving communication, efficiency, and quality of service. In low- and middle-income countries, they also play a unique role as a source of information about health outcomes and behaviors of patients and healthcare workers, while providing a suitable channel to deliver both personalized and collective policy interventions. We propose a framework to study user engagement with mobile health, focusing on healthcare workers and digital health apps designed to support them in resource-poor settings. The behavioral logs produced by these apps can be transformed into daily time series characterizing each user's activity. We use probabilistic and survival analysis to build multiple personalized measures of meaningful engagement, which could serve to tailor content and digital interventions suiting each health worker's specific needs. Special attention is given to the problem of detecting churn, understood as a marker of complete disengagement. We discuss the application of our methods to the Indian and Ethiopian users of the Safe Delivery App, a capacity-building tool for skilled birth attendants. This work represents an important step towards a full characterization of user engagement in mobile health applications, which can significantly enhance the abilities of health workers and, ultimately, save lives.
    ResNorm: Tackling Long-tailed Degree Distribution Issue in Graph Neural Networks via Normalization. (arXiv:2206.08181v1 [cs.LG])
    Graph Neural Networks (GNNs) have attracted much attention due to their ability in learning representations from graph-structured data. Despite the successful applications of GNNs in many domains, the optimization of GNNs is less well studied, and the performance on node classification heavily suffers from the long-tailed node degree distribution. This paper focuses on improving the performance of GNNs via normalization. In detail, by studying the long-tailed distribution of node degrees in the graph, we propose a novel normalization method for GNNs, which is termed ResNorm (\textbf{Res}haping the long-tailed distribution into a normal-like distribution via \textbf{norm}alization). The $scale$ operation of ResNorm reshapes the node-wise standard deviation (NStd) distribution so as to improve the accuracy of tail nodes (\textit{i}.\textit{e}., low-degree nodes). We provide a theoretical interpretation and empirical evidence for understanding the mechanism of the above $scale$. In addition to the long-tailed distribution issue, over-smoothing is also a fundamental issue plaguing the community. To this end, we analyze the behavior of the standard shift and prove that the standard shift serves as a preconditioner on the weight matrix, increasing the risk of over-smoothing. With the over-smoothing issue in mind, we design a $shift$ operation for ResNorm that simulates the degree-specific parameter strategy in a low-cost manner. Extensive experiments have validated the effectiveness of ResNorm on several node classification benchmark datasets.
    PROFHIT: Probabilistic Robust Forecasting for Hierarchical Time-series. (arXiv:2206.07940v1 [cs.LG])
    Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have underlying hierarchical relations. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. Recent state-of-art probabilistic forecasting methods also impose hierarchical relations on point predictions and samples of distribution which does not account for coherency of forecast distributions. Previous works also silently assume that datasets are always consistent with given hierarchical relations and do not adapt to real-world datasets that show deviation from this assumption. We close both these gaps and propose PROFHIT, which is a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy. PROFHIT uses a flexible probabilistic Bayesian approach and introduces a novel Distributional Coherency regularization to learn from hierarchical relations for entire forecast distribution that enables robust and calibrated forecasts as well as adapt to datasets of varying hierarchical consistency. On evaluating PROFHIT over wide range of datasets, we observed 41-88% better performance in accuracy and calibration. Due to modeling the coherency over full distribution, we observed that PROFHIT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing where other methods' performance severely degrade by over 70%.
    iBoot: Image-bootstrapped Self-Supervised Video Representation Learning. (arXiv:2206.08339v1 [cs.CV])
    Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold. Thus, directly learning self-supervised representations from video data might result in sub-optimal performance. To address this, we propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework, enabling the model to learn strong spatial and temporal information without relying on the video labeled data. To this end, we modify the typical video-based SSL design and objective to encourage the video encoder to \textit{subsume} the semantic content of an image-based model trained on a general domain. The proposed algorithm is shown to learn much more efficiently (i.e. in less epochs and with a smaller batch) and results in a new state-of-the-art performance on standard downstream tasks among single-modality SSL methods.
    When a RF Beats a CNN and GRU, Together -- A Comparison of Deep Learning and Classical Machine Learning Approaches for Encrypted Malware Traffic Classification. (arXiv:2206.08004v1 [cs.CR])
    Internet traffic classification is widely used to facilitate network management. It plays a crucial role in Quality of Services (QoS), Quality of Experience (QoE), network visibility, intrusion detection, and traffic trend analyses. While there is no theoretical guarantee that deep learning (DL)-based solutions perform better than classic machine learning (ML)-based ones, DL-based models have become the common default. This paper compares well-known DL-based and ML-based models and shows that in the case of malicious traffic classification, state-of-the-art DL-based solutions do not necessarily outperform the classical ML-based ones. We exemplify this finding using two well-known datasets for a varied set of tasks, such as: malware detection, malware family classification, detection of zero-day attacks, and classification of an iteratively growing dataset. Note that, it is not feasible to evaluate all possible models to make a concrete statement, thus, the above finding is not a recommendation to avoid DL-based models, but rather empirical proof that in some cases, there are more simplistic solutions, that may perform even better.
    Concentration of Data Encoding in Parameterized Quantum Circuits. (arXiv:2206.08273v1 [quant-ph])
    Variational quantum algorithms have been acknowledged as a leading strategy to realize near-term quantum advantages in meaningful tasks, including machine learning and combinatorial optimization. When applied to tasks involving classical data, such algorithms generally begin with quantum circuits for data encoding and then train quantum neural networks (QNNs) to minimize target functions. Although QNNs have been widely studied to improve these algorithms' performance on practical tasks, there is a gap in systematically understanding the influence of data encoding on the eventual performance. In this paper, we make progress in filling this gap by considering the common data encoding strategies based on parameterized quantum circuits. We prove that, under reasonable assumptions, the distance between the average encoded state and the maximally mixed state could be explicitly upper-bounded with respect to the width and depth of the encoding circuit. This result in particular implies that the average encoded state will concentrate on the maximally mixed state at an exponential speed on depth. Such concentration seriously limits the capabilities of quantum classifiers, and strictly restricts the distinguishability of encoded states from a quantum information perspective. We further support our findings by numerically verifying these results on both synthetic and public data sets. Our results highlight the significance of quantum data encoding in machine learning tasks and may shed light on future encoding strategies.
    Zero-Shot Video Question Answering via Frozen Bidirectional Language Models. (arXiv:2206.08155v1 [cs.CV])
    Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models will be made publicly available at https://antoyang.github.io/frozenbilm.html.
    A Closer Look at Smoothness in Domain Adversarial Training. (arXiv:2206.08213v1 [cs.LG])
    Domain adversarial training has been ubiquitous for achieving invariant representations and is used widely for various domain adaptation tasks. In recent times, methods converging to smooth optima have shown improved generalization for supervised learning tasks like classification. In this work, we analyze the effect of smoothness enhancing formulations on domain adversarial training, the objective of which is a combination of task loss (eg. classification, regression, etc.) and adversarial terms. We find that converging to a smooth minima with respect to (w.r.t.) task loss stabilizes the adversarial training leading to better performance on target domain. In contrast to task loss, our analysis shows that converging to smooth minima w.r.t. adversarial loss leads to sub-optimal generalization on the target domain. Based on the analysis, we introduce the Smooth Domain Adversarial Training (SDAT) procedure, which effectively enhances the performance of existing domain adversarial methods for both classification and object detection tasks. Our analysis also provides insight into the extensive usage of SGD over Adam in the community for domain adversarial training.
    MAGIC: Microlensing Analysis Guided by Intelligent Computation. (arXiv:2206.08199v1 [astro-ph.IM])
    The modeling of binary microlensing light curves via the standard sampling-based method can be challenging, because of the time-consuming light curve computation and the pathological likelihood landscape in the high-dimensional parameter space. In this work, we present MAGIC, which is a machine learning framework to efficiently and accurately infer the microlensing parameters of binary events with realistic data quality. In MAGIC, binary microlensing parameters are divided into two groups and inferred separately with different neural networks. The key feature of MAGIC is the introduction of neural controlled differential equation, which provides the capability to handle light curves with irregular sampling and large data gaps. Based on simulated light curves, we show that MAGIC can achieve fractional uncertainties of a few percent on the binary mass ratio and separation. We also test MAGIC on a real microlensing event. MAGIC is able to locate the degenerate solutions even when large data gaps are introduced. As irregular samplings are common in astronomical surveys, our method also has implications to other studies that involve time series.
    Pythae: Unifying Generative Autoencoders in Python -- A Benchmarking Use Case. (arXiv:2206.08309v1 [cs.LG])
    In recent years, deep generative models have attracted increasing interest due to their capacity to model complex distributions. Among those models, variational autoencoders have gained popularity as they have proven both to be computationally efficient and yield impressive results in multiple fields. Following this breakthrough, extensive research has been done in order to improve the original publication, resulting in a variety of different VAE models in response to different tasks. In this paper we present Pythae, a versatile open-source Python library providing both a unified implementation and a dedicated framework allowing straightforward, reproducible and reliable use of generative autoencoder models. We then propose to use this library to perform a case study benchmark where we present and compare 19 generative autoencoder models representative of some of the main improvements on downstream tasks such as image reconstruction, generation, classification, clustering and interpolation. The open-source library can be found at https://github.com/clementchadebec/benchmark_VAE.
    Functional Output Regression with Infimal Convolution: Exploring the Huber and $\epsilon$-insensitive Losses. (arXiv:2206.08220v1 [stat.ML])
    The focus of the paper is functional output regression (FOR) with convoluted losses. While most existing work consider the square loss setting, we leverage extensions of the Huber and the $\epsilon$-insensitive loss (induced by infimal convolution) and propose a flexible framework capable of handling various forms of outliers and sparsity in the FOR family. We derive computationally tractable algorithms relying on duality to tackle the resulting tasks in the context of vector-valued reproducing kernel Hilbert spaces. The efficiency of the approach is demonstrated and contrasted with the classical squared loss setting on both synthetic and real-world benchmarks.
    Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching. (arXiv:2206.08265v1 [stat.ML])
    Score-based generative models have excellent performance in terms of generation quality and likelihood. They model the data distribution by matching a parameterized score network with first-order data score functions. The score network can be used to define an ODE ("score-based diffusion ODE") for exact likelihood evaluation. However, the relationship between the likelihood of the ODE and the score matching objective is unclear. In this work, we prove that matching the first-order score is not sufficient to maximize the likelihood of the ODE, by showing a gap between the maximum likelihood and score matching objectives. To fill up this gap, we show that the negative likelihood of the ODE can be bounded by controlling the first, second, and third-order score matching errors; and we further present a novel high-order denoising score matching method to enable maximum likelihood training of score-based diffusion ODEs. Our algorithm guarantees that the higher-order matching error is bounded by the training error and the lower-order errors. We empirically observe that by high-order score matching, score-based diffusion ODEs achieve better likelihood on both synthetic data and CIFAR-10, while retaining the high generation quality.
    Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. (arXiv:2206.08311v1 [cs.LG])
    Estimating counterfactual outcomes over time has the potential to unlock personalized healthcare by assisting decision-makers to answer ''what-iF'' questions. Existing causal inference approaches typically consider regular, discrete-time intervals between observations and treatment decisions and hence are unable to naturally model irregularly sampled data, which is the common setting in practice. To handle arbitrary observation patterns, we interpret the data as samples from an underlying continuous-time process and propose to model its latent trajectory explicitly using the mathematics of controlled differential equations. This leads to a new approach, the Treatment Effect Neural Controlled Differential Equation (TE-CDE), that allows the potential outcomes to be evaluated at any time point. In addition, adversarial training is used to adjust for time-dependent confounding which is critical in longitudinal settings and is an added challenge not encountered in conventional time-series. To assess solutions to this problem, we propose a controllable simulation environment based on a model of tumor growth for a range of scenarios with irregular sampling reflective of a variety of clinical scenarios. TE-CDE consistently outperforms existing approaches in all simulated scenarios with irregular sampling.
    Inherent Inconsistencies of Feature Importance. (arXiv:2206.08204v1 [cs.LG])
    The black-box nature of modern machine learning techniques invokes a practical and ethical need for explainability. Feature importance aims to meet this need by assigning scores to features, so humans can understand their influence on predictions. Feature importance can be used to explain predictions under different settings: of the entire sample space or a specific instance; of model behavior, or the dependencies in the data themselves. However, in most cases thus far, each of these settings was studied in isolation. We attempt to develop a sound feature importance score framework by defining a small set of desired properties. Surprisingly, we prove an inconsistency theorem, showing that the expected properties cannot hold simultaneously. To overcome this difficulty, we propose the novel notion of re-partitioning the feature space into separable sets. Such sets are constructed to contain features that exhibit inter-set independence with respect to the target variable. We show that there exists a unique maximal partitioning into separable sets. Moreover, assigning scores to separable sets, instead of single features, unifies the results of commonly used feature importance scores and annihilates the inconsistencies we demonstrated.
    On Scaled Methods for Saddle Point Problems. (arXiv:2206.08303v1 [cs.LG])
    Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation. We use the Extra Gradient and its improved version with negative momentum as the basic method. Experimental studies on GANs show good applicability not only for Adam, but also for other less popular methods.
    Search-Based Testing Approach for Deep Reinforcement Learning Agents. (arXiv:2206.07813v1 [cs.SE])
    Deep Reinforcement Learning (DRL) algorithms have been increasingly employed during the last decade to solve various decision-making problems such as autonomous driving and robotics. However, these algorithms have faced great challenges when deployed in safety-critical environments since they often exhibit erroneous behaviors that can lead to potentially critical errors. One way to assess the safety of DRL agents is to test them to detect possible faults leading to critical failures during their execution. This raises the question of how we can efficiently test DRL policies to ensure their correctness and adherence to safety requirements. Most existing works on testing DRL agents use adversarial attacks that perturb states or actions of the agent. However, such attacks often lead to unrealistic states of the environment. Their main goal is to test the robustness of DRL agents rather than testing the compliance of agents' policies with respect to requirements. Due to the huge state space of DRL environments, the high cost of test execution, and the black-box nature of DRL algorithms, the exhaustive testing of DRL agents is impossible. In this paper, we propose a Search-based Testing Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL agent by effectively searching for failing executions of the agent within a limited testing budget. We use machine learning models and a dedicated genetic algorithm to narrow the search towards faulty episodes. We apply STARLA on a Deep-Q-Learning agent which is widely used as a benchmark and show that it significantly outperforms Random Testing by detecting more faults related to the agent's policy. We also investigate how to extract rules that characterize faulty episodes of the DRL agent using our search results. Such rules can be used to understand the conditions under which the agent fails and thus assess its deployment risks.
    On the well-spread property and its relation to linear regression. (arXiv:2206.08092v1 [cs.LG])
    We consider the robust linear regression model $\boldsymbol{y} = X\beta^* + \boldsymbol{\eta}$, where an adversary oblivious to the design $X \in \mathbb{R}^{n \times d}$ may choose $\boldsymbol{\eta}$ to corrupt all but a (possibly vanishing) fraction of the observations $\boldsymbol{y}$ in an arbitrary way. Recent work [dLN+21, dNS21] has introduced efficient algorithms for consistent recovery of the parameter vector. These algorithms crucially rely on the design matrix being well-spread (a matrix is well-spread if its column span is far from any sparse vector). In this paper, we show that there exists a family of design matrices lacking well-spreadness such that consistent recovery of the parameter vector in the above robust linear regression model is information-theoretically impossible. We further investigate the average-case time complexity of certifying well-spreadness of random matrices. We show that it is possible to efficiently certify whether a given $n$-by-$d$ Gaussian matrix is well-spread if the number of observations is quadratic in the ambient dimension. We complement this result by showing rigorous evidence -- in the form of a lower bound against low-degree polynomials -- of the computational hardness of this same certification problem when the number of observations is $o(d^2)$.
    Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency. (arXiv:2206.08222v1 [cs.CV])
    Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network architectures initialized with supervised ImageNet representations. In this work, we shift focus to adapting modern architectures for object recognition -- the increasingly popular Vision Transformer (ViT) -- and modern pretraining based on self-supervised learning (SSL). Inspired by the design of recent SSL approaches based on learning from partial image inputs generated via masking or cropping -- either by learning to predict the missing pixels, or learning representational invariances to such augmentations -- we propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs. PACMAC first performs in-domain SSL on pooled source and target data to learn task-discriminative features, and then probes the model's predictive consistency across a set of partial target inputs generated via a novel attention-conditioned masking strategy, to identify reliable candidates for self-training. Our simple approach leads to consistent performance gains over competing methods that use ViTs and self-supervised initializations on standard object recognition benchmarks. Code available at https://github.com/virajprabhu/PACMAC
    Linearity Grafting: Relaxed Neuron Pruning Helps Certifiable Robustness. (arXiv:2206.07839v1 [cs.LG])
    Certifiable robustness is a highly desirable property for adopting deep neural networks (DNNs) in safety-critical scenarios, but often demands tedious computations to establish. The main hurdle lies in the massive amount of non-linearity in large DNNs. To trade off the DNN expressiveness (which calls for more non-linearity) and robustness certification scalability (which prefers more linearity), we propose a novel solution to strategically manipulate neurons, by "grafting" appropriate levels of linearity. The core of our proposal is to first linearize insignificant ReLU neurons, to eliminate the non-linear components that are both redundant for DNN performance and harmful to its certification. We then optimize the associated slopes and intercepts of the replaced linear activations for restoring model performance while maintaining certifiability. Hence, typical neuron pruning could be viewed as a special case of grafting a linear function of the fixed zero slopes and intercept, that might overly restrict the network flexibility and sacrifice its performance. Extensive experiments on multiple datasets and network backbones show that our linearity grafting can (1) effectively tighten certified bounds; (2) achieve competitive certifiable robustness without certified robust training (i.e., over 30% improvements on CIFAR-10 models); and (3) scale up complete verification to large adversarially trained models with 17M parameters. Codes are available at https://github.com/VITA-Group/Linearity-Grafting.
    Patch-level Representation Learning for Self-supervised Vision Transformers. (arXiv:2206.07990v1 [cs.CV])
    Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advantages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the benefit, i.e., they are architecture-agnostic. In particular, we focus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neighbors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with SelfPatch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, SelfPatch significantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.
    Neural Scene Representation for Locomotion on Structured Terrain. (arXiv:2206.08077v1 [cs.RO])
    We propose a learning-based method to reconstruct the local terrain for locomotion with a mobile robot traversing urban environments. Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the algorithm estimates the topography in the robot's vicinity. The raw measurements from these cameras are noisy and only provide partial and occluded observations that in many cases do not show the terrain the robot stands on. Therefore, we propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement. The model consists of a 4D fully convolutional network on point clouds that learns the geometric priors to complete the scene from the context and an auto-regressive feedback to leverage spatio-temporal consistency and use evidence from the past. The network can be solely trained with synthetic data, and due to extensive augmentation, it is robust in the real world, as shown in the validation on a quadrupedal robot, ANYmal, traversing challenging settings. We run the pipeline on the robot's onboard low-power computer using an efficient sparse tensor implementation and show that the proposed method outperforms classical map representations.
    All the World's a (Hyper)Graph: A Data Drama. (arXiv:2206.08225v1 [cs.LG])
    We introduce Hyperbard, a dataset of diverse relational data representations derived from Shakespeare's plays. Our representations range from simple graphs capturing character co-occurrence in single scenes to hypergraphs encoding complex communication settings and character contributions as hyperedges with edge-specific node weights. By making multiple intuitive representations readily available for experimentation, we facilitate rigorous representation robustness checks in graph learning, graph mining, and network analysis, highlighting the advantages and drawbacks of specific representations. Leveraging the data released in Hyperbard, we demonstrate that many solutions to popular graph mining problems are highly dependent on the representation choice, thus calling current graph curation practices into question. As an homage to our data source, and asserting that science can also be art, we present all our points in the form of a play.
    Adversarial Privacy Protection on Speech Enhancement. (arXiv:2206.08170v1 [cs.SD])
    Speech is easily leaked imperceptibly, such as being recorded by mobile phones in different situations. Private content in speech may be maliciously extracted through speech enhancement technology. Speech enhancement technology has developed rapidly along with deep neural networks (DNNs), but adversarial examples can cause DNNs to fail. In this work, we propose an adversarial method to degrade speech enhancement systems. Experimental results show that generated adversarial examples can erase most content information in original examples or replace it with target speech content through speech enhancement. The word error rate (WER) between an enhanced original example and enhanced adversarial example recognition result can reach 89.0%. WER of target attack between enhanced adversarial example and target example is low to 33.75% . Adversarial perturbation can bring the rate of change to the original example to more than 1.4430. This work can prevent the malicious extraction of speech.
    Learning Interpretable Representations of Entanglement in Quantum Optics Experiments using Deep Generative Models. (arXiv:2109.02490v2 [cs.LG] UPDATED)
    Quantum physics experiments produce interesting phenomena such as interference or entanglement, which are core properties of numerous future quantum technologies. The complex relationship between the setup structure of a quantum experiment and its entanglement properties is essential to fundamental research in quantum optics but is difficult to intuitively understand. We present a deep generative model of quantum optics experiments where a variational autoencoder is trained on a dataset of quantum optics experimental setups. In a series of computational experiments, we investigate the learned representation of our Quantum Optics Variational Auto Encoder (QOVAE) and its internal understanding of the quantum optics world. We demonstrate that the QOVAE learns an interpretable representation of quantum optics experiments and the relationship between experiment structure and entanglement. We show the QOVAE is able to generate novel experiments for highly entangled quantum states with specific distributions that match its training data. The QOVAE can learn to generate specific entangled states and efficiently search the space of experiments that produce highly entangled quantum states. Importantly, we are able to interpret how the QOVAE structures its latent space, finding curious patterns that we can explain in terms of quantum physics. The results demonstrate how we can use and understand the internal representations of deep generative models in a complex scientific domain. The QOVAE and the insights from our investigations can be immediately applied to other physical systems.
    Multi-Agent Learning for Iterative Dominance Elimination: Formal Barriers and New Algorithms. (arXiv:2111.05486v2 [cs.GT] UPDATED)
    Dominated actions are natural (and perhaps the simplest possible) multi-agent generalizations of sub-optimal actions as in standard single-agent decision making. Thus similar to standard bandit learning, a basic learning question in multi-agent systems is whether agents can learn to efficiently eliminate all dominated actions in an unknown game if they can only observe noisy bandit feedback about the payoff of their played actions. Surprisingly, despite a seemingly simple task, we show a quite negative result; that is, standard no regret algorithms -- including the entire family of Dual Averaging algorithms -- provably take exponentially many rounds to eliminate all dominated actions. Moreover, algorithms with the stronger no swap regret also suffer similar exponential inefficiency. To overcome these barriers, we develop a new algorithm that adjusts Exp3 with Diminishing Historical rewards (termed Exp3-DH); Exp3-DH gradually forgets history at carefully tailored rates. We prove that when all agents run Exp3-DH (a.k.a., self-play in multi-agent learning), all dominated actions can be iteratively eliminated within polynomially many rounds. Our experimental results further demonstrate the efficiency of Exp3-DH, and that state-of-the-art bandit algorithms, even those developed specifically for learning in games, fail to eliminate all dominated actions efficiently.
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v6 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.
    When to intervene? Prescriptive Process Monitoring Under Uncertainty and Resource Constraints. (arXiv:2206.07745v1 [cs.AI])
    Prescriptive process monitoring approaches leverage historical data to prescribe runtime interventions that will likely prevent negative case outcomes or improve a process's performance. A centerpiece of a prescriptive process monitoring method is its intervention policy: a decision function determining if and when to trigger an intervention on an ongoing case. Previous proposals in this field rely on intervention policies that consider only the current state of a given case. These approaches do not consider the tradeoff between triggering an intervention in the current state, given the level of uncertainty of the underlying predictive models, versus delaying the intervention to a later state. Moreover, they assume that a resource is always available to perform an intervention (infinite capacity). This paper addresses these gaps by introducing a prescriptive process monitoring method that filters and ranks ongoing cases based on prediction scores, prediction uncertainty, and causal effect of the intervention, and triggers interventions to maximize a gain function, considering the available resources. The proposal is evaluated using a real-life event log. The results show that the proposed method outperforms existing baselines regarding total gain.
    Optimization-Derived Learning with Essential Convergence Analysis of Training and Hyper-training. (arXiv:2206.07875v1 [cs.LG])
    Recently, Optimization-Derived Learning (ODL) has attracted attention from learning and vision areas, which designs learning models from the perspective of optimization. However, previous ODL approaches regard the training and hyper-training procedures as two separated stages, meaning that the hyper-training variables have to be fixed during the training process, and thus it is also impossible to simultaneously obtain the convergence of training and hyper-training variables. In this work, we design a Generalized Krasnoselskii-Mann (GKM) scheme based on fixed-point iterations as our fundamental ODL module, which unifies existing ODL methods as special cases. Under the GKM scheme, a Bilevel Meta Optimization (BMO) algorithmic framework is constructed to solve the optimal training and hyper-training variables together. We rigorously prove the essential joint convergence of the fixed-point iteration for training and the process of optimizing hyper-parameters for hyper-training, both on the approximation quality, and on the stationary analysis. Experiments demonstrate the efficiency of BMO with competitive performance on sparse coding and real-world applications such as image deconvolution and rain streak removal.
    Challenges and Opportunities in Deep Reinforcement Learning with Graph Neural Networks: A Comprehensive review of Algorithms and Applications. (arXiv:2206.07922v1 [cs.LG])
    Deep reinforcement learning (DRL) has empowered a variety of artificial intelligence fields, including pattern recognition, robotics, recommendation-systems, and gaming. Similarly, graph neural networks (GNN) have also demonstrated their superior performance in supervised learning for graph-structured data. In recent times, the fusion of GNN with DRL for graph-structured environments has attracted a lot of attention. This paper provides a comprehensive review of these hybrid works. These works can be classified into two categories: (1) algorithmic enhancement, where DRL and GNN complement each other for better utility; (2) application-specific enhancement, where DRL and GNN support each other. This fusion effectively addresses various complex problems in engineering and life sciences. Based on the review, we further analyze the applicability and benefits of fusing these two domains, especially in terms of increasing generalizability and reducing computational complexity. Finally, the key challenges in integrating DRL and GNN, and potential future research directions are highlighted, which will be of interest to the broader machine learning community.
    Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems. (arXiv:2206.07808v1 [cs.CL])
    We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
    Is Continual Learning Truly Learning Representations Continually?. (arXiv:2206.08101v1 [cs.LG])
    Continual learning (CL) aims to learn from sequentially arriving tasks without forgetting previous tasks. Whereas CL algorithms have tried to achieve higher average test accuracy across all the tasks learned so far, learning continuously useful representations is critical for successful generalization and downstream transfer. To measure representational quality, we re-train only the output layers using a small balanced dataset for all the tasks, evaluating the average accuracy without any biased predictions toward the current task. We also test on several downstream tasks, measuring transfer learning accuracy of the learned representations. By testing our new formalism on ImageNet-100 and ImageNet-1000, we find that using more exemplar memory is the only option to make a meaningful difference in learned representations, and most of the regularization- or distillation-based CL algorithms that use the exemplar memory fail to learn continuously useful representations in class-incremental learning. Surprisingly, unsupervised (or self-supervised) CL with sufficient memory size can achieve comparable performance to the supervised counterparts. Considering non-trivial labeling costs, we claim that finding more efficient unsupervised CL algorithms that minimally use exemplary memory would be the next promising direction for CL research.
    Process, Bias and Temperature Scalable CMOS Analog Computing Circuits for Machine Learning. (arXiv:2205.05664v2 [cs.AR] UPDATED)
    Analog computing is attractive compared to digital computing due to its potential for achieving higher computational density and higher energy efficiency. However, unlike digital circuits, conventional analog computing circuits cannot be easily mapped across different process nodes due to differences in transistor biasing regimes, temperature variations and limited dynamic range. In this work, we generalize the previously reported margin-propagation-based analog computing framework for designing novel \textit{shape-based analog computing} (S-AC) circuits that can be easily cross-mapped across different process nodes. Similar to digital designs S-AC designs can also be scaled for precision, speed, and power. As a proof-of-concept, we show several examples of S-AC circuits implementing mathematical functions that are commonly used in machine learning (ML) architectures. Using circuit simulations we demonstrate that the circuit input/output characteristics remain robust when mapped from a planar CMOS 180nm process to a FinFET 7nm process. Also, using benchmark datasets we demonstrate that the classification accuracy of a S-AC based neural network remains robust when mapped across the two processes and to changes in temperature.
    Evaluating Short-Term Forecasting of Multiple Time Series in IoT Environments. (arXiv:2206.07784v1 [cs.LG])
    Modern Internet of Things (IoT) environments are monitored via a large number of IoT enabled sensing devices, with the data acquisition and processing infrastructure setting restrictions in terms of computational power and energy resources. To alleviate this issue, sensors are often configured to operate at relatively low sampling frequencies, yielding a reduced set of observations. Nevertheless, this can hamper dramatically subsequent decision-making, such as forecasting. To address this problem, in this work we evaluate short-term forecasting in highly underdetermined cases, i.e., the number of sensor streams is much higher than the number of observations. Several statistical, machine learning and neural network-based models are thoroughly examined with respect to the resulting forecasting accuracy on five different real-world datasets. The focus is given on a unified experimental protocol especially designed for short-term prediction of multiple time series at the IoT edge. The proposed framework can be considered as an important step towards establishing a solid forecasting strategy in resource constrained IoT applications.
    On Calibrated Model Uncertainty in Deep Learning. (arXiv:2206.07795v1 [cs.LG])
    Estimated uncertainty by approximate posteriors in Bayesian neural networks are prone to miscalibration, which leads to overconfident predictions in critical tasks that have a clear asymmetric cost or significant losses. Here, we extend the approximate inference for the loss-calibrated Bayesian framework to dropweights based Bayesian neural networks by maximising expected utility over a model posterior to calibrate uncertainty in deep learning. Furthermore, we show that decisions informed by loss-calibrated uncertainty can improve diagnostic performance to a greater extent than straightforward alternatives. We propose Maximum Uncertainty Calibration Error (MUCE) as a metric to measure calibrated confidence, in addition to its prediction especially for high-risk applications, where the goal is to minimise the worst-case deviation between error and estimated uncertainty. In experiments, we show the correlation between error in prediction and estimated uncertainty by interpreting Wasserstein distance as the accuracy of prediction. We evaluated the effectiveness of our approach to detecting Covid-19 from X-Ray images. Experimental results show that our method reduces miscalibration considerably, without impacting the models accuracy and improves reliability of computer-based diagnostics.
    The Scattering Transform Network with Generalized Morse Wavelets and Its Application to Music Genre Classification. (arXiv:2206.07857v1 [eess.AS])
    We propose to use the Generalized Morse Wavelets (GMWs) instead of commonly-used Morlet (or Gabor) wavelets in the Scattering Transform Network (STN), which we call the GMW-STN, for signal classification problems. The GMWs form a parameterized family of truly analytic wavelets while the Morlet wavelets are only approximately analytic. The analyticity of underlying wavelet filters in the STN is particularly important for nonstationary oscillatory signals such as music signals because it improves interpretability of the STN representations by providing multiscale amplitude and phase (and consequently frequency) information of input signals. We demonstrate the superiority of the GMW-STN over the conventional STN in music genre classification using the so-called GTZAN database. Moreover, we show the performance improvement of the GMW-STN by increasing its number of layers to three over the typical two-layer STN.}
    Evaluating Self-Supervised Learning for Molecular Graph Embeddings. (arXiv:2206.08005v1 [cs.LG])
    Graph Self-Supervised Learning (GSSL) paves the way for learning graph embeddings without expert annotation, which is particularly impactful for molecular graphs since the number of possible molecules is enormous and labels are expensive to obtain. However, by design, GSSL methods are not trained to perform well on one downstream task but aim for transferability to many, making evaluating them less straightforward. As a step toward obtaining profiles of molecular graph embeddings with diverse and interpretable attributes, we introduce Molecular Graph Representation Evaluation (MolGraphEval), a suite of probe tasks, categorised into (i) topological-, (ii) substructure-, and (iii) embedding space properties. By benchmarking existing GSSL methods on both existing downstream datasets and MolGraphEval, we discover surprising discrepancies between conclusions drawn from existing datasets alone versus more fine-grained probing, suggesting that current evaluation protocols do not provide the whole picture. Our modular, automated end-to-end GSSL pipeline code will be released upon acceptance, including standardised graph loading, experiment management, and embedding evaluation.
    Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them. (arXiv:2107.11630v2 [cs.LG] UPDATED)
    Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance {\epsilon} (in some metric), we can build a similarly robust (but inefficient) classifier for attacks at distance {\epsilon}/2. Our reduction is computationally inefficient, and thus cannot be used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated. To illustrate, we revisit 13 detector defenses. For 11/13 cases, we show that the claimed detection results would imply an inefficient classifier with robustness far beyond the state-of-the-art.
    Hybrid full-field thermal characterization of additive manufacturing processes using physics-informed neural networks with data. (arXiv:2206.07756v1 [cs.LG])
    Understanding the thermal behavior of additive manufacturing (AM) processes is crucial for enhancing the quality control and enabling customized process design. Most purely physics-based computational models suffer from intensive computational costs, thus not suitable for online control and iterative design application. Data-driven models taking advantage of the latest developed computational tools can serve as a more efficient surrogate, but they are usually trained over a large amount of simulation data and often fail to effectively use small but high-quality experimental data. In this work, we developed a hybrid physics-based data-driven thermal modeling approach of AM processes using physics-informed neural networks. Specifically, partially observed temperature data measured from an infrared camera is combined with the physics laws to predict full-field temperature history and to discover unknown material and process parameters. In the numerical and experimental examples, the effectiveness of adding auxiliary training data and using the technique of transfer learning on training efficiency and prediction accuracy, as well as the ability to identify unknown parameters with partially observed data, are demonstrated. The results show that the hybrid thermal model can effectively identify unknown parameters and capture the full-field temperature accurately, and thus it has the potential to be used in iterative process design and real-time process control of AM.
    Architectural Backdoors in Neural Networks. (arXiv:2206.07840v1 [cs.LG])
    Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data and data sampling procedures to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a link between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of training settings.
    Equivariant Diffusion for Molecule Generation in 3D. (arXiv:2203.17003v2 [cs.LG] UPDATED)
    This work introduces a diffusion model for molecule generation in 3D that is equivariant to Euclidean transformations. Our E(3) Equivariant Diffusion Model (EDM) learns to denoise a diffusion process with an equivariant network that jointly operates on both continuous (atom coordinates) and categorical features (atom types). In addition, we provide a probabilistic analysis which admits likelihood computation of molecules using our model. Experimentally, the proposed method significantly outperforms previous 3D molecular generative methods regarding the quality of generated samples and efficiency at training time.
    Federated Data Analytics: A Study on Linear Models. (arXiv:2206.07786v1 [stat.AP])
    As edge devices become increasingly powerful, data analytics are gradually moving from a centralized to a decentralized regime where edge compute resources are exploited to process more of the data locally. This regime of analytics is coined as federated data analytics (FDA). In spite of the recent success stories of FDA, most literature focuses exclusively on deep neural networks. In this work, we take a step back to develop an FDA treatment for one of the most fundamental statistical models: linear regression. Our treatment is built upon hierarchical modeling that allows borrowing strength across multiple groups. To this end, we propose two federated hierarchical model structures that provide a shared representation across devices to facilitate information sharing. Notably, our proposed frameworks are capable of providing uncertainty quantification, variable selection, hypothesis testing and fast adaptation to new unseen data. We validate our methods on a range of real-life applications including condition monitoring for aircraft engines. The results show that our FDA treatment for linear models can serve as a competing benchmark model for future development of federated algorithms.
    Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization. (arXiv:2206.07837v1 [cs.LG])
    Real-world data collected from multiple domains can have multiple, distinct distribution shifts over multiple attributes. However, state-of-the art advances in domain generalization (DG) algorithms focus only on specific shifts over a single attribute. We introduce datasets with multi-attribute distribution shifts and find that existing DG algorithms fail to generalize. To explain this, we use causal graphs to characterize the different types of shifts based on the relationship between spurious attributes and the classification label. Each multi-attribute causal graph entails different constraints over observed variables, and therefore any algorithm based on a single, fixed independence constraint cannot work well across all shifts. We present Causally Adaptive Constraint Minimization (CACM), a new algorithm for identifying the correct independence constraints for regularization. Results on fully synthetic, MNIST and small NORB datasets, covering binary and multi-valued attributes and labels, confirm our theoretical claim: correct independence constraints lead to the highest accuracy on unseen domains whereas incorrect constraints fail to do so. Our results demonstrate the importance of modeling the causal relationships inherent in the data-generating process: in many cases, it is impossible to know the correct regularization constraints without this information.
    Differentially Private Multi-Party Data Release for Linear Regression. (arXiv:2206.07998v1 [cs.CR])
    Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects. However the majority of prior work has focused on scenarios where a single party owns all the data. In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects. Within the context of linear regression that allow all parties to train models on the complete data without the ability to infer private attributes or identities of individuals, we start with directly applying Gaussian mechanism and show it has the small eigenvalue problem. We further propose our novel method and prove it asymptotically converges to the optimal (non-private) solutions with increasing dataset size. We substantiate the theoretical results through experiments on both artificial and real-world datasets.
    TransDrift: Modeling Word-Embedding Drift using Transformer. (arXiv:2206.08081v1 [cs.CL])
    In modern NLP applications, word embeddings are a crucial backbone that can be readily shared across a number of tasks. However as the text distributions change and word semantics evolve over time, the downstream applications using the embeddings can suffer if the word representations do not conform to the data drift. Thus, maintaining word embeddings to be consistent with the underlying data distribution is a key problem. In this work, we tackle this problem and propose TransDrift, a transformer-based prediction model for word embeddings. Leveraging the flexibility of transformer, our model accurately learns the dynamics of the embedding drift and predicts the future embedding. In experiments, we compare with existing methods and show that our model makes significantly more accurate predictions of the word embedding than the baselines. Crucially, by applying the predicted embeddings as a backbone for downstream classification tasks, we show that our embeddings lead to superior performance compared to the previous methods.
    Towards Understanding How Machines Can Learn Causal Overhypotheses. (arXiv:2206.08353v1 [cs.LG])
    Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the key challenges for current machine learning algorithms is modeling and understanding causal overhypotheses: transferable abstract hypotheses about sets of causal relationships. In contrast, even young children spontaneously learn and use causal overhypotheses. In this work, we present a new benchmark -- a flexible environment which allows for the evaluation of existing techniques under variable causal overhypotheses -- and demonstrate that many existing state-of-the-art methods have trouble generalizing in this environment. The code and resources for this benchmark are available at https://github.com/CannyLab/casual_overhypotheses.
    Meta-Learning Dynamics Forecasting Using Task Inference. (arXiv:2102.10271v4 [cs.LG] UPDATED)
    Current deep learning models for dynamics forecasting struggle with generalization. They can only forecast in a specific domain and fail when applied to systems with different parameters, external forces, or boundary conditions. We propose a model-based meta-learning method called DyAd which can generalize across heterogeneous domains by partitioning them into different tasks. DyAd has two parts: an encoder which infers the time-invariant hidden features of the task with weak supervision, and a forecaster which learns the shared dynamics of the entire domain. The encoder adapts and controls the forecaster during inference using adaptive instance normalization and adaptive padding. Theoretically, we prove that the generalization error of such procedure is related to the task relatedness in the source domain, as well as the domain differences between source and target. Experimentally, we demonstrate that our model outperforms state-of-the-art approaches on both turbulent flow and real-world ocean data forecasting tasks.
    Approximate Frank-Wolfe Algorithms over Graph-structured Support Sets. (arXiv:2107.00472v2 [math.OC] UPDATED)
    In this paper, we propose approximate Frank-Wolfe (FW) algorithms to solve convex optimization problems over graph-structured support sets where the \textit{linear minimization oracle} (LMO) cannot be efficiently obtained in general. We first demonstrate that two popular approximation assumptions (\textit{additive} and \textit{multiplicative gap errors)}, are not valid for our problem, in that no cheap gap-approximate LMO oracle exists in general. Instead, a new \textit{approximate dual maximization oracle} (DMO) is proposed, which approximates the inner product rather than the gap. When the objective is $L$-smooth, we prove that the standard FW method using a $\delta$-approximate DMO converges as $\mathcal{O}(L / \delta t + (1-\delta)(\delta^{-1} + \delta^{-2}))$ in general, and as $\mathcal{O}(L/(\delta^2(t+2)))$ over a $\delta$-relaxation of the constraint set. Additionally, when the objective is $\mu$-strongly convex and the solution is unique, a variant of FW converges to $\mathcal{O}(L^2\log(t)/(\mu \delta^6 t^2))$ with the same per-iteration complexity. Our empirical results suggest that even these improved bounds are pessimistic, with significant improvement in recovering real-world images with graph-structured sparsity.
    Closed-Form Diffeomorphic Transformations for Time Series Alignment. (arXiv:2206.08107v1 [cs.LG])
    Time series alignment methods call for highly expressive, differentiable and invertible warping functions which preserve temporal topology, i.e diffeomorphisms. Diffeomorphic warping functions can be generated from the integration of velocity fields governed by an ordinary differential equation (ODE). Gradient-based optimization frameworks containing diffeomorphic transformations require to calculate derivatives to the differential equation's solution with respect to the model parameters, i.e. sensitivity analysis. Unfortunately, deep learning frameworks typically lack automatic-differentiation-compatible sensitivity analysis methods; and implicit functions, such as the solution of ODE, require particular care. Current solutions appeal to adjoint sensitivity methods, ad-hoc numerical solvers or ResNet's Eulerian discretization. In this work, we present a closed-form expression for the ODE solution and its gradient under continuous piecewise-affine (CPA) velocity functions. We present a highly optimized implementation of the results on CPU and GPU. Furthermore, we conduct extensive experiments on several datasets to validate the generalization ability of our model to unseen data for time-series joint alignment. Results show significant improvements both in terms of efficiency and accuracy.
    Deep Learning for Time Series Forecasting: Tutorial and Literature Survey. (arXiv:2004.10240v2 [cs.LG] UPDATED)
    Deep learning based forecasting methods have become the methods of choice in many applications of time series prediction or forecasting often outperforming other approaches. Consequently, over the last years, these methods are now ubiquitous in large-scale industrial forecasting applications and have consistently ranked among the best entries in forecasting competitions (e.g., M4 and M5). This practical success has further increased the academic interest to understand and improve deep forecasting methods. In this article we provide an introduction and overview of the field: We present important building blocks for deep forecasting in some depth; using these building blocks, we then survey the breadth of the recent deep forecasting literature.
    On Privacy and Personalization in Cross-Silo Federated Learning. (arXiv:2206.07902v1 [cs.LG])
    While the application of differential privacy (DP) has been well-studied in cross-device federated learning (FL), there is a lack of work considering DP for cross-silo FL, a setting characterized by a limited number of clients each containing many data subjects. In cross-silo FL, usual notions of client-level privacy are less suitable as real-world privacy regulations typically concern in-silo data subjects rather than the silos themselves. In this work, we instead consider the more realistic notion of silo-specific item-level privacy, where silos set their own privacy targets for their local examples. Under this setting, we reconsider the roles of personalization in federated learning. In particular, we show that mean-regularized multi-task learning (MR-MTL), a simple personalization framework, is a strong baseline for cross-silo FL: under stronger privacy, silos are further incentivized to "federate" with each other to mitigate DP noise, resulting in consistent improvements relative to standard baseline methods. We provide a thorough empirical study of competing methods as well as a theoretical characterization of MR-MTL for a mean estimation problem, highlighting the interplay between privacy and cross-silo data heterogeneity. Our work serves to establish baselines for private cross-silo FL as well as identify key directions of future work in this area.
    BlindFL: Vertical Federated Machine Learning without Peeking into Your Data. (arXiv:2206.07975v1 [cs.LG])
    Due to the rising concerns on privacy protection, how to build machine learning (ML) models over different data sources with security guarantees is gaining more popularity. Vertical federated learning (VFL) describes such a case where ML models are built upon the private data of different participated parties that own disjoint features for the same set of instances, which fits many real-world collaborative tasks. Nevertheless, we find that existing solutions for VFL either support limited kinds of input features or suffer from potential data leakage during the federated execution. To this end, this paper aims to investigate both the functionality and security of ML modes in the VFL scenario. To be specific, we introduce BlindFL, a novel framework for VFL training and inference. First, to address the functionality of VFL models, we propose the federated source layers to unite the data from different parties. Various kinds of features can be supported efficiently by the federated source layers, including dense, sparse, numerical, and categorical features. Second, we carefully analyze the security during the federated execution and formalize the privacy requirements. Based on the analysis, we devise secure and accurate algorithm protocols, and further prove the security guarantees under the ideal-real simulation paradigm. Extensive experiments show that BlindFL supports diverse datasets and models efficiently whilst achieves robust privacy guarantees.
    Balancing Discriminability and Transferability for Source-Free Domain Adaptation. (arXiv:2206.08009v1 [cs.CV])
    Conventional domain adaptation (DA) techniques aim to improve domain transferability by learning domain-invariant representations; while concurrently preserving the task-discriminability knowledge gathered from the labeled source data. However, the requirement of simultaneous access to labeled source and unlabeled target renders them unsuitable for the challenging source-free DA setting. The trivial solution of realizing an effective original to generic domain mapping improves transferability but degrades task discriminability. Upon analyzing the hurdles from both theoretical and empirical standpoints, we derive novel insights to show that a mixup between original and corresponding translated generic samples enhances the discriminability-transferability trade-off while duly respecting the privacy-oriented source-free setting. A simple but effective realization of the proposed insights on top of the existing source-free DA approaches yields state-of-the-art performance with faster convergence. Beyond single-source, we also outperform multi-source prior-arts across both classification and semantic segmentation benchmarks.
    Forming Effective Human-AI Teams: Building Machine Learning Models that Complement the Capabilities of Multiple Experts. (arXiv:2206.07948v1 [cs.AI])
    Machine learning (ML) models are increasingly being used in application domains that often involve working together with human experts. In this context, it can be advantageous to defer certain instances to a single human expert when they are difficult to predict for the ML model. While previous work has focused on scenarios with one distinct human expert, in many real-world situations several human experts with varying capabilities may be available. In this work, we propose an approach that trains a classification model to complement the capabilities of multiple human experts. By jointly training the classifier together with an allocation system, the classifier learns to accurately predict those instances that are difficult for the human experts, while the allocation system learns to pass each instance to the most suitable team member -- either the classifier or one of the human experts. We evaluate our proposed approach in multiple experiments on public datasets with "synthetic" experts and a real-world medical dataset annotated by multiple radiologists. Our approach outperforms prior work and is more accurate than the best human expert or a classifier. Furthermore, it is flexibly adaptable to teams of varying sizes and different levels of expert diversity.
    Generalization Bounds via Convex Analysis. (arXiv:2202.04985v2 [stat.ML] UPDATED)
    Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
    Hardness prediction of age-hardening aluminum alloy based on ensemble learning. (arXiv:2206.08011v1 [cond-mat.mtrl-sci])
    With the rapid development of artificial intelligence, the combination of material database and machine learning has driven the progress of material informatics. Because aluminum alloy is widely used in many fields, so it is significant to predict the properties of aluminum alloy. In this thesis, the data of Al-Cu-Mg-X (X: Zn, Zr, etc.) alloy are used to input the composition, aging conditions (time and temperature) and predict its hardness. An ensemble learning solution based on automatic machine learning and an attention mechanism introduced into the secondary learner of deep neural network are proposed respectively. The experimental results show that selecting the correct secondary learner can further improve the prediction accuracy of the model. This manuscript introduces the attention mechanism to improve the secondary learner based on deep neural network, and obtains a fusion model with better performance. The R-Square of the best model is 0.9697 and the MAE is 3.4518HV.
    U-PET: MRI-based Dementia Detection with Joint Generation of Synthetic FDG-PET Images. (arXiv:2206.08078v1 [eess.IV])
    Alzheimer's disease (AD) is the most common cause of dementia. An early detection is crucial for slowing down the disease and mitigating risks related to the progression. While the combination of MRI and FDG-PET is the best image-based tool for diagnosis, FDG-PET is not always available. The reliable detection of Alzheimer's disease with only MRI could be beneficial, especially in regions where FDG-PET might not be affordable for all patients. To this end, we propose a multi-task method based on U-Net that takes T1-weighted MR images as an input to generate synthetic FDG-PET images and classifies the dementia progression of the patient into cognitive normal (CN), cognitive impairment (MCI), and AD. The attention gates used in both task heads can visualize the most relevant parts of the brain, guiding the examiner and adding interpretability. Results show the successful generation of synthetic FDG-PET images and a performance increase in disease classification over the naive single-task baseline.
    HyperImpute: Generalized Iterative Imputation with Automatic Model Selection. (arXiv:2206.07769v1 [stat.ML])
    Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.
    Let Invariant Rationale Discovery Inspire Graph Contrastive Learning. (arXiv:2206.07869v1 [cs.LG])
    Leading graph contrastive learning (GCL) methods perform graph augmentations in two fashions: (1) randomly corrupting the anchor graph, which could cause the loss of semantic information, or (2) using domain knowledge to maintain salient features, which undermines the generalization to other domains. Taking an invariance look at GCL, we argue that a high-performing augmentation should preserve the salient semantics of anchor graphs regarding instance-discrimination. To this end, we relate GCL with invariant rationale discovery, and propose a new framework, Rationale-aware Graph Contrastive Learning (RGCL). Specifically, without supervision signals, RGCL uses a rationale generator to reveal salient features about graph instance-discrimination as the rationale, and then creates rationale-aware views for contrastive learning. This rationale-aware pre-training scheme endows the backbone model with the powerful representation ability, further facilitating the fine-tuning on downstream tasks. On MNIST-Superpixel and MUTAG datasets, visual inspections on the discovered rationales showcase that the rationale generator successfully captures the salient features (i.e. distinguishing semantic nodes in graphs). On biochemical molecule and social network benchmark datasets, the state-of-the-art performance of RGCL demonstrates the effectiveness of rationale-aware views for contrastive learning. Our codes are available at https://github.com/lsh0520/RGCL.
    AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation. (arXiv:2206.08023v1 [eess.IV])
    Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https://amos22.grand-challenge.org.
    CARLANE: A Lane Detection Benchmark for Unsupervised Domain Adaptation from Simulation to multiple Real-World Domains. (arXiv:2206.08083v1 [cs.CV])
    Unsupervised Domain Adaptation demonstrates great potential to mitigate domain shifts by transferring models from labeled source domains to unlabeled target domains. While Unsupervised Domain Adaptation has been applied to a wide variety of complex vision tasks, only few works focus on lane detection for autonomous driving. This can be attributed to the lack of publicly available datasets. To facilitate research in these directions, we propose CARLANE, a 3-way sim-to-real domain adaptation benchmark for 2D lane detection. CARLANE encompasses the single-target datasets MoLane and TuLane and the multi-target dataset MuLane. These datasets are built from three different domains, which cover diverse scenes and contain a total of 163K unique images, 118K of which are annotated. In addition we evaluate and report systematic baselines, including our own method, which builds upon Prototypical Cross-domain Self-supervised Learning. We find that false positive and false negative rates of the evaluated domain adaptation methods are high compared to those of fully supervised baselines. This affirms the need for benchmarks such as CARLANE to further strengthen research in Unsupervised Domain Adaptation for lane detection. CARLANE, all evaluated models and the corresponding implementations are publicly available at https://carlanebenchmark.github.io.
    Feature Selection using e-values. (arXiv:2206.05391v2 [stat.ML] UPDATED)
    In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not. The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a $p$-dimensional feature space, this procedure requires fitting only the full model and evaluating $p+1$ models, as opposed to the traditional requirement of fitting and evaluating $2^p$ models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection.
    Personalized Federated Learning via Variational Bayesian Inference. (arXiv:2206.07977v1 [cs.LG])
    Federated learning faces huge challenges from model overfitting due to the lack of data and statistical diversity among clients. To address these challenges, this paper proposes a novel personalized federated learning method via Bayesian variational inference named pFedBayes. To alleviate the overfitting, weight uncertainty is introduced to neural networks for clients and the server. To achieve personalization, each client updates its local distribution parameters by balancing its construction error over private data and its KL divergence with global distribution from the server. Theoretical analysis gives an upper bound of averaged generalization error and illustrates that the convergence rate of the generalization error is minimax optimal up to a logarithmic factor. Experiments show that the proposed method outperforms other advanced personalized methods on personalized models, e.g., pFedBayes respectively outperforms other SOTA algorithms by 1.25%, 0.42% and 11.71% on MNIST, FMNIST and CIFAR-10 under non-i.i.d. limited data.
    On Error and Compression Rates for Prototype Rules. (arXiv:2206.08014v1 [cs.LG])
    We study the close interplay between error and compression in the non-parametric multiclass classification setting in terms of prototype learning rules. We focus in particular on a close variant of a recently proposed compression-based learning rule termed OptiNet. Beyond its computational merits, this rule has been recently shown to be universally consistent in any metric instance space that admits a universally consistent rule -- the first learning algorithm known to enjoy this property. However, its error and compression rates have been left open. Here we derive such rates in the case where instances reside in Euclidean space under commonly posed smoothness and tail conditions on the data distribution. We first show that OptiNet achieves non-trivial compression rates while enjoying near minimax-optimal error rates. We then proceed to study a novel general compression scheme for further compressing prototype rules that locally adapts to the noise level without sacrificing accuracy. Applying it to OptiNet, we show that under a geometric margin condition, further gain in the compression rate is achieved. Experimental results comparing the performance of the various methods are presented.
    Reinforcement Learning-enhanced Shared-account Cross-domain Sequential Recommendation. (arXiv:2206.08088v1 [cs.IR])
    Shared-account Cross-domain Sequential Recommendation (SCSR) is an emerging yet challenging task that simultaneously considers the shared-account and cross-domain characteristics in the sequential recommendation. Existing works on SCSR are mainly based on Recurrent Neural Network (RNN) and Graph Neural Network (GNN) but they ignore the fact that although multiple users share a single account, it is mainly occupied by one user at a time. This observation motivates us to learn a more accurate user-specific account representation by attentively focusing on its recent behaviors. Furthermore, though existing works endow lower weights to irrelevant interactions, they may still dilute the domain information and impede the cross-domain recommendation. To address the above issues, we propose a reinforcement learning-based solution, namely RL-ISN, which consists of a basic cross-domain recommender and a reinforcement learning-based domain filter. Specifically, to model the account representation in the shared-account scenario, the basic recommender first clusters users' mixed behaviors as latent users, and then leverages an attention model over them to conduct user identification. To reduce the impact of irrelevant domain information, we formulate the domain filter as a hierarchical reinforcement learning task, where a high-level task is utilized to decide whether to revise the whole transferred sequence or not, and if it does, a low-level task is further performed to determine whether to remove each interaction within it or not. To evaluate the performance of our solution, we conduct extensive experiments on two real-world datasets, and the experimental results demonstrate the superiority of our RL-ISN method compared with the state-of-the-art recommendation methods.
    Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains. (arXiv:2202.12636v2 [stat.ML] UPDATED)
    Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (\texttt{HSVLMC}) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with Bayesian calibration that (i) takes into account the effect of dimensionality reduction raised by domain mappings in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases and a practical multi-fidelity steam turbine exhaust problem.
    Faculty Distillation with Optimal Transport. (arXiv:2204.11526v2 [cs.LG] UPDATED)
    The outpouring of various pre-trained models empowers knowledge distillation~(KD) by providing abundant teacher resources. Meanwhile, exploring the massive model repository to select a suitable teacher and further extracting its knowledge become daunting challenges. Standard KD fails to surmount two obstacles when training a student with the availability of plentiful pre-trained teachers, i.e., the "faculty". First, we need to seek out the most contributive teacher in the faculty efficiently rather than enumerating all of them for a student. Second, since the teacher may be pre-trained on different tasks w.r.t. the student, we must distill the knowledge from a more general label space. This paper studies this ``faculty distillation'' where a student performs teacher assessment and generalized knowledge reuse. We take advantage of optimal transport to construct a unifying objective for both problems, which bridges the semantic gap and measures the relatedness between a pair of models. This objective can select the most relevant teacher, and we minimize the same objective over student parameters to transfer the knowledge from the selected teacher subsequently. Experiments in various settings demonstrate the succinctness and versatility of our proposed method.
    Partial Identifiability for Nonnegative Matrix Factorization. (arXiv:2206.08022v1 [math.NA])
    Given a nonnegative matrix factorization, $R$, and a factorization rank, $r$, Exact nonnegative matrix factorization (Exact NMF) decomposes $R$ as the product of two nonnegative matrices, $C$ and $S$ with $r$ columns, such as $R = CS^\top$. A central research topic in the literature is the conditions under which such a decomposition is unique/identifiable, up to trivial ambiguities. In this paper, we focus on partial identifiability, that is, the uniqueness of a subset of columns of $C$ and $S$. We start our investigations with the data-based uniqueness (DBU) theorem from the chemometrics literature. The DBU theorem analyzes all feasible solutions of Exact NMF, and relies on sparsity conditions on $C$ and $S$. We provide a mathematically rigorous theorem of a recently published restricted version of the DBU theorem, relying only on simple sparsity and algebraic conditions: it applies to a particular solution of Exact NMF (as opposed to all feasible solutions) and allows us to guarantee the partial uniqueness of a single column of $C$ or $S$. Second, based on a geometric interpretation of the restricted DBU theorem, we obtain a new partial identifiability result. We prove it is stronger than the restricted DBU theorem, given that a proper preprocessing on the Exact NMF is used. This geometric interpretation also leads us to another partial identifiability result in the case $r=3$. Third, we show how partial identifiability results can be used sequentially to guarantee the identifiability of more columns of $C$ and $S$. We illustrate these results on several examples, including one from the chemometrics literature.
    Approximately Equivariant Networks for Imperfectly Symmetric Dynamics. (arXiv:2201.11969v4 [cs.LG] UPDATED)
    Incorporating symmetry as an inductive bias into neural network architecture has led to improvements in generalization, data efficiency, and physical consistency in dynamics modeling. Methods such as CNNs or equivariant neural networks use weight tying to enforce symmetries such as shift invariance or rotational equivariance. However, despite the fact that physical laws obey many symmetries, real-world dynamical data rarely conforms to strict mathematical symmetry either due to noisy or incomplete data or to symmetry breaking features in the underlying dynamical system. We explore approximately equivariant networks which are biased towards preserving symmetry but are not strictly constrained to do so. By relaxing equivariance constraints, we find that our models can outperform both baselines with no symmetry bias and baselines with overly strict symmetry in both simulated turbulence domains and real-world multi-stream jet flow.
    Performance analysis of coreset selection for quantum implementation of K-Means clustering algorithm. (arXiv:2206.07852v1 [quant-ph])
    Quantum computing is anticipated to offer immense computational capabilities which could provide efficient solutions to many data science problems. However, the current generation of quantum devices are small and noisy, which makes it difficult to process large data sets relevant for practical problems. Coreset selection aims to circumvent this problem by reducing the size of input data without compromising the accuracy. Recent work has shown that coreset selection can help to implement quantum K-Means clustering problem. However, the impact of coreset selection on the performance of quantum K-Means clustering has not been explored. In this work, we compare the relative performance of two coreset techniques (BFL16 and ONESHOT), and the size of coreset construction in each case, with respect to a variety of data sets and layout the advantages and limitations of coreset selection in implementing quantum algorithms. We also investigated the effect of depolarisation quantum noise and bit-flip error, and implemented the Quantum AutoEncoder technique for surpassing the noise effect. Our work provides useful insights for future implementation of data science algorithms on near-term quantum devices where problem size has been reduced by coreset selection.
    Domain Generalization via Selective Consistency Regularization for Time Series Classification. (arXiv:2206.07876v1 [cs.LG])
    Domain generalization methods aim to learn models robust to domain shift with data from a limited number of source domains and without access to target domain samples during training. Popular domain alignment methods for domain generalization seek to extract domain-invariant features by minimizing the discrepancy between feature distributions across all domains, disregarding inter-domain relationships. In this paper, we instead propose a novel representation learning methodology that selectively enforces prediction consistency between source domains estimated to be closely-related. Specifically, we hypothesize that domains share different class-informative representations, so instead of aligning all domains which can cause negative transfer, we only regularize the discrepancy between closely-related domains. We apply our method to time-series classification tasks and conduct comprehensive experiments on three public real-world datasets. Our method significantly improves over the baseline and achieves better or competitive performance in comparison with state-of-the-art methods in terms of both accuracy and model calibration.
    Distributed Online Learning Algorithm With Differential Privacy Strategy for Convex Nondecomposable Global Objectives. (arXiv:2206.07944v1 [math.OC])
    In this paper, we deal with a general distributed constrained online learning problem with privacy over time-varying networks, where a class of nondecomposable objective functions are considered. Under this setting, each node only controls a part of the global decision variable, and the goal of all nodes is to collaboratively minimize the global objective over a time horizon $T$ while guarantees the security of the transmitted information. For such problems, we first design a novel generic algorithm framework, named as DPSDA, of differentially private distributed online learning using the Laplace mechanism and the stochastic variants of dual averaging method. Then, we propose two algorithms, named as DPSDA-C and DPSDA-PS, under this framework. Theoretical results show that both algorithms attain an expected regret upper bound in $\mathcal{O}( \sqrt{T} )$ when the objective function is convex, which matches the best utility achievable by cutting-edge algorithms. Finally, numerical experiment results on both real-world and randomly generated datasets verify the effectiveness of our algorithms.
    A Machine Learning-based Digital Twin for Electric Vehicle Battery Modeling. (arXiv:2206.08080v1 [cs.LG])
    The widespread adoption of Electric Vehicles (EVs) is limited by their reliance on batteries with presently low energy and power densities compared to liquid fuels and are subject to aging and performance deterioration over time. For this reason, monitoring the battery State Of Charge (SOC) and State Of Health (SOH) during the EV lifetime is a very relevant problem. This work proposes a battery digital twin structure designed to accurately reflect battery dynamics at the run time. To ensure a high degree of correctness concerning non-linear phenomena, the digital twin relies on data-driven models trained on traces of battery evolution over time: a SOH model, repeatedly executed to estimate the degradation of maximum battery capacity, and a SOC model, retrained periodically to reflect the impact of aging. The proposed digital twin structure will be exemplified on a public dataset to motivate its adoption and prove its effectiveness, with high accuracy and inference and retraining times compatible with onboard execution.
    An Intriguing Property of Geophysics Inversion. (arXiv:2204.13731v2 [cs.LG] UPDATED)
    Inversion techniques are widely used to reconstruct subsurface physical properties (e.g., velocity, conductivity) from surface-based geophysical measurements (e.g., seismic, electric/magnetic (EM) data). The problems are governed by partial differential equations (PDEs) like the wave or Maxwell's equations. Solving geophysical inversion problems is challenging due to the ill-posedness and high computational cost. To alleviate those issues, recent studies leverage deep neural networks to learn the inversion mappings from measurements to the property directly. In this paper, we show that such a mapping can be well modeled by a very shallow (but not wide) network with only five layers. This is achieved based on our new finding of an intriguing property: a near-linear relationship between the input and output, after applying integral transform in high dimensional space. In particular, when dealing with the inversion from seismic data to subsurface velocity governed by a wave equation, the integral results of velocity with Gaussian kernels are linearly correlated to the integral of seismic data with sine kernels. Furthermore, this property can be easily turned into a light-weight encoder-decoder network for inversion. The encoder contains the integration of seismic data and the linear transformation without need for fine-tuning. The decoder only consists of a single transformer block to reverse the integral of velocity. Experiments show that this interesting property holds for two geophysics inversion problems over four different datasets. Compared to much deeper InversionNet, our method achieves comparable accuracy, but consumes significantly fewer parameters.
    Integrating User and Item Reviews in Deep Cooperative Neural Networks for Movie Ranking Prediction. (arXiv:2205.06296v4 [cs.IR] UPDATED)
    User evaluations include a significant quantity of information across online platforms. This information source has been neglected by the majority of existing recommendation systems, despite its potential to ease the sparsity issue and enhance the quality of suggestions. This work presents a deep model for concurrently learning item attributes and user behaviour from review text. Deep Cooperative Neural Network (DeepCoNN) is the suggested model consisting of two parallel neural networks connected in their final layers. One of the networks focuses on learning user behaviour from reviews submitted by the user, while the other network learns item attributes from user reviews. On top, a shared layer is added to connect these two networks. Similar to factorization machine approaches, the shared layer allows latent factors acquired for people and things to interact with each other. On a number of datasets, DeepCoNN surpasses all baseline recommendation systems, according to experimental findings.
    Unlocking High-Accuracy Differentially Private Image Classification through Scale. (arXiv:2204.13650v2 [cs.LG] UPDATED)
    Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method for deep learning, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA without extra data on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained NFNet-F3, we achieve a remarkable 83.8% top-1 accuracy on ImageNet under (0.5, 8*10^{-7})-DP. Additionally, we also achieve 86.7% top-1 accuracy under (8, 8 \cdot 10^{-7})-DP, which is just 4.3% below the current non-private SOTA for this task. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification.
    Queried Unlabeled Data Improves and Robustifies Class-Incremental Learning. (arXiv:2206.07842v1 [cs.LG])
    Class-incremental learning (CIL) suffers from the notorious dilemma between learning newly added classes and preserving previously learned class knowledge. That catastrophic forgetting issue could be mitigated by storing historical data for replay, which yet would cause memory overheads as well as imbalanced prediction updates. To address this dilemma, we propose to leverage "free" external unlabeled data querying in continual learning. We first present a CIL with Queried Unlabeled Data (CIL-QUD) scheme, where we only store a handful of past training samples as anchors and use them to query relevant unlabeled examples each time. Along with new and past stored data, the queried unlabeled are effectively utilized, through learning-without-forgetting (LwF) regularizers and class-balance training. Besides preserving model generalization over past and current tasks, we next study the problem of adversarial robustness for CIL-QUD. Inspired by the recent success of learning robust models with unlabeled data, we explore a new robustness-aware CIL setting, where the learned adversarial robustness has to resist forgetting and be transferred as new tasks come in continually. While existing options easily fail, we show queried unlabeled data can continue to benefit, and seamlessly extend CIL-QUD into its robustified versions, RCIL-QUD. Extensive experiments demonstrate that CIL-QUD achieves substantial accuracy gains on CIFAR-10 and CIFAR-100, compared to previous state-of-the-art CIL approaches. Moreover, RCIL-QUD establishes the first strong milestone for robustness-aware CIL. Codes are available in https://github.com/VITA-Group/CIL-QUD.
    Robust Attack Graph Generation. (arXiv:2206.07776v1 [cs.LG])
    We present a method to learn automaton models that are more robust to input modifications. It iteratively aligns sequences to a learned model, modifies the sequences to their aligned versions, and re-learns the model. Automaton learning algorithms are typically very good at modeling the frequent behavior of a software system. Our solution can be used to also learn the behavior present in infrequent sequences, as these will be aligned to the frequent ones represented by the model. We apply our method to the SAGE tool for modeling attacker behavior from intrusion alerts. In experiments, we demonstrate that our algorithm learns models that can handle noise such as added and removed symbols from sequences. Furthermore, it learns more concise models that fit better to the training data.
    DeepJSCC-Q: Constellation Constrained Deep Joint Source-Channel Coding. (arXiv:2206.08100v1 [eess.IV])
    Recent works have shown that modern machine learning techniques can provide an alternative approach to the long-standing joint source-channel coding (JSCC) problem. Very promising initial results, superior to popular digital schemes that utilize separate source and channel codes, have been demonstrated for wireless image and video transmission using deep neural networks (DNNs). However, end-to-end training of such schemes requires a differentiable channel input representation; hence, prior works have assumed that any complex value can be transmitted over the channel. This can prevent the application of these codes in scenarios where the hardware or protocol can only admit certain sets of channel inputs, prescribed by a digital constellation. Herein, we propose DeepJSCC-Q, an end-to-end optimized JSCC solution for wireless image transmission using a finite channel input alphabet. We show that DeepJSCC-Q can achieve similar performance to prior works that allow any complex valued channel input, especially when high modulation orders are available, and that the performance asymptotically approaches that of unconstrained channel input as the modulation order increases. Importantly, DeepJSCC-Q preserves the graceful degradation of image quality in unpredictable channel conditions, a desirable property for deployment in mobile systems with rapidly changing channel conditions.
    Lessons learned from the NeurIPS 2021 MetaDL challenge: Backbone fine-tuning without episodic meta-learning dominates for few-shot learning image classification. (arXiv:2206.08138v1 [cs.LG])
    Although deep neural networks are capable of achieving performance superior to humans on various tasks, they are notorious for requiring large amounts of data and computing resources, restricting their success to domains where such resources are available. Metalearning methods can address this problem by transferring knowledge from related tasks, thus reducing the amount of data and computing resources needed to learn new tasks. We organize the MetaDL competition series, which provide opportunities for research groups all over the world to create and experimentally assess new meta-(deep)learning solutions for real problems. In this paper, authored collaboratively between the competition organizers and the top-ranked participants, we describe the design of the competition, the datasets, the best experimental results, as well as the top-ranked methods in the NeurIPS 2021 challenge, which attracted 15 active teams who made it to the final phase (by outperforming the baseline), making over 100 code submissions during the feedback phase. The solutions of the top participants have been open-sourced. The lessons learned include that learning good representations is essential for effective transfer learning.
    Feature Overcorrelation in Deep Graph Neural Networks: A New Perspective. (arXiv:2206.07743v1 [cs.LG])
    Recent years have witnessed remarkable success achieved by graph neural networks (GNNs) in many real-world applications such as recommendation and drug discovery. Despite the success, oversmoothing has been identified as one of the key issues which limit the performance of deep GNNs. It indicates that the learned node representations are highly indistinguishable due to the stacked aggregators. In this paper, we propose a new perspective to look at the performance degradation of deep GNNs, i.e., feature overcorrelation. Through empirical and theoretical study on this matter, we demonstrate the existence of feature overcorrelation in deeper GNNs and reveal potential reasons leading to this issue. To reduce the feature correlation, we propose a general framework DeCorr which can encourage GNNs to encode less redundant information. Extensive experiments have demonstrated that DeCorr can help enable deeper GNNs and is complementary to existing techniques tackling the oversmoothing issue.
    Pareto Invariant Risk Minimization. (arXiv:2206.07766v1 [cs.LG])
    Despite the success of invariant risk minimization (IRM) in tackling the Out-of-Distribution generalization problem, IRM can compromise the optimality when applied in practice. The practical variants of IRM, e.g., IRMv1, have been shown to have significant gaps with IRM and thus could fail to capture the invariance even in simple problems. Moreover, the optimization procedure in IRMv1 involves two intrinsically conflicting objectives, and often requires careful tuning for the objective weights. To remedy the above issues, we reformulate IRM as a multi-objective optimization problem, and propose a new optimization scheme for IRM, called PAreto Invariant Risk Minimization (PAIR). PAIR can adaptively adjust the optimization direction under the objective conflicts. Furthermore, we show PAIR can empower the practical IRM variants to overcome the barriers with the original IRM when provided with proper guidance. We conduct experiments with ColoredMNIST to confirm our theory and the effectiveness of PAIR.
    Kantorovich Strikes Back! Wasserstein GANs are not Optimal Transport?. (arXiv:2206.07767v1 [cs.LG])
    Wasserstein Generative Adversarial Networks (WGANs) are the popular generative models built on the theory of Optimal Transport (OT) and the Kantorovich duality. Despite the success of WGANs, it is still unclear how well the underlying OT dual solvers approximate the OT cost (Wasserstein-1 distance, $\mathbb{W}_{1}$) and the OT gradient needed to update the generator. In this paper, we address these questions. We construct 1-Lipschitz functions and use them to build ray monotone transport plans. This strategy yields pairs of continuous benchmark distributions with the analytically known OT plan, OT cost and OT gradient in high-dimensional spaces such as spaces of images. We thoroughly evaluate popular WGAN dual form solvers (gradient penalty, spectral normalization, entropic regularization, etc.) using these benchmark pairs. Even though these solvers perform well in WGANs, none of them faithfully compute $\mathbb{W}_{1}$ in high dimensions. Nevertheless, many provide a meaningful approximation of the OT gradient. These observations suggest that these solvers should not be treated as good estimators of $\mathbb{W}_{1}$, but to some extent they indeed can be used in variational problems requiring the minimization of $\mathbb{W}_{1}$.
    Condensing Graphs via One-Step Gradient Matching. (arXiv:2206.07746v1 [cs.LG])
    As training deep learning models on large dataset takes a lot of time and resources, it is desired to construct a small synthetic dataset with which we can train deep learning models sufficiently. There are recent works that have explored solutions on condensing image datasets through complex bi-level optimization. For instance, dataset condensation (DC) matches network gradients w.r.t. large-real data and small-synthetic data, where the network weights are optimized for multiple steps at each outer iteration. However, existing approaches have their inherent limitations: (1) they are not directly applicable to graphs where the data is discrete; and (2) the condensation process is computationally expensive due to the involved nested optimization. To bridge the gap, we investigate efficient dataset condensation tailored for graph datasets where we model the discrete graph structure as a probabilistic model. We further propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. Extensive experiments on various graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance and our method is significantly faster than multi-step gradient matching (e.g. 15x in CIFAR10 for synthesizing 500 graphs).
    SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. (arXiv:2206.07764v1 [cs.CV])
    The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v1 [cs.LG])
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.
    Improving Diversity with Adversarially Learned Transformations for Domain Generalization. (arXiv:2206.07736v1 [cs.LG])
    To be successful in single source domain generalization, maximizing diversity of synthesized domains has emerged as one of the most effective strategies. Many of the recent successes have come from methods that pre-specify the types of diversity that a model is exposed to during training, so that it can ultimately generalize well to new domains. However, na\"ive diversity based augmentations do not work effectively for domain generalization either because they cannot model large domain shift, or because the span of transforms that are pre-specified do not cover the types of shift commonly occurring in domain generalization. To address this issue, we present a novel framework that uses adversarially learned transformations (ALT) using a neural network to model plausible, yet hard image transformations that fool the classifier. This network is randomly initialized for each batch and trained for a fixed number of steps to maximize classification error. Further, we enforce consistency between the classifier's predictions on the clean and transformed images. With extensive empirical analysis, we find that this new form of adversarial transformations achieve both objectives of diversity and hardness simultaneously, outperforming all existing techniques on competitive benchmarks for single source domain generalization. We also show that ALT can naturally work with existing diversity modules to produce highly distinct, and large transformations of the source domain leading to state-of-the-art performance.
    Gaussian Blue Noise. (arXiv:2206.07798v1 [cs.GR])
    Among the various approaches for producing point distributions with blue noise spectrum, we argue for an optimization framework using Gaussian kernels. We show that with a wise selection of optimization parameters, this approach attains unprecedented quality, provably surpassing the current state of the art attained by the optimal transport (BNOT) approach. Further, we show that our algorithm scales smoothly and feasibly to high dimensions while maintaining the same quality, realizing unprecedented high-quality high-dimensional blue noise sets. Finally, we show an extension to adaptive sampling.
    Simple and Efficient Architectures for Semantic Segmentation. (arXiv:2206.08236v1 [cs.CV])
    Though the state-of-the architectures for semantic segmentation, such as HRNet, demonstrate impressive accuracy, the complexity arising from their salient design choices hinders a range of model acceleration tools, and further they make use of operations that are inefficient on current hardware. This paper demonstrates that a simple encoder-decoder architecture with a ResNet-like backbone and a small multi-scale head, performs on-par or better than complex semantic segmentation architectures such as HRNet, FANet and DDRNets. Naively applying deep backbones designed for Image Classification to the task of Semantic Segmentation leads to sub-par results, owing to a much smaller effective receptive field of these backbones. Implicit among the various design choices put forth in works like HRNet, DDRNet, and FANet are networks with a large effective receptive field. It is natural to ask if a simple encoder-decoder architecture would compare favorably if comprised of backbones that have a larger effective receptive field, though without the use of inefficient operations like dilated convolutions. We show that with minor and inexpensive modifications to ResNets, enlarging the receptive field, very simple and competitive baselines can be created for Semantic Segmentation. We present a family of such simple architectures for desktop as well as mobile targets, which match or exceed the performance of complex models on the Cityscapes dataset. We hope that our work provides simple yet effective baselines for practitioners to develop efficient semantic segmentation models.
    GoodBye WaveNet -- A Language Model for Raw Audio with Context of 1/2 Million Samples. (arXiv:2206.08297v1 [cs.SD])
    Modeling long-term dependencies for audio signals is a particularly challenging problem, as even small-time scales yield on the order of a hundred thousand samples. With the recent advent of Transformers, neural architectures became good at modeling dependencies over longer time scales, but they suffered from quadratic constraints to scale them. We propose a generative auto-regressive architecture that can model audio waveforms over quite a large context, greater than 500,000 samples. Our work is adapted to learn time dependencies by learning a latent representation by a CNN front-end, and then learning dependencies over these representations using Transformer encoders, fully trained end-to-end: thereby allowing to learn representations as it deems fit for the next sample. Unlike previous works that compared different time scales to show improvement, we use a standard dataset, with the same number of parameters/context to show improvements. We achieve a state-of-the-art performance as compared to other approaches such as Wavenet, SaSHMI, and Sample-RNN on a standard dataset for modeling long-term structure. This work gives very exciting direction for the field, given improvements in context modeling that can be scaled with more data, as well as potentially better results by using billions/trillions of parameters.
    Towards Robust and Reproducible Active Learning Using Neural Networks. (arXiv:2002.09564v3 [cs.LG] UPDATED)
    Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we demonstrate that under identical experimental settings, different types of AL algorithms (uncertainty based, diversity based, and committee based) produce an inconsistent gain over random sampling baseline. Through a variety of experiments, controlling for sources of stochasticity, we show that variance in performance metrics achieved by AL algorithms can lead to results that are not consistent with the previously reported results. We also found that under strong regularization, AL methods show marginal or no advantage over the random sampling baseline under a variety of experimental conditions. Finally, we conclude with a set of recommendations on how to assess the results using a new AL algorithm to ensure results are reproducible and robust under changes in experimental conditions. We share our codes to facilitate AL evaluations. We believe our findings and recommendations will help advance reproducible research in AL using neural networks. We open source our code at https://github.com/PrateekMunjal/TorchAL
    Compressed-VFL: Communication-Efficient Learning with Vertically Partitioned Data. (arXiv:2206.08330v1 [cs.LG])
    We propose Compressed Vertical Federated Learning (C-VFL) for communication-efficient training on vertically partitioned data. In C-VFL, a server and multiple parties collaboratively train a model on their respective features utilizing several local iterations and sharing compressed intermediate results periodically. Our work provides the first theoretical analysis of the effect message compression has on distributed training over vertically partitioned data. We prove convergence of non-convex objectives at a rate of $O(\frac{1}{\sqrt{T}})$ when the compression error is bounded over the course of training. We provide specific requirements for convergence with common compression techniques, such as quantization and top-$k$ sparsification. Finally, we experimentally show compression can reduce communication by over $90\%$ without a significant decrease in accuracy over VFL without compression.
    "Understanding Robustness Lottery": A Comparative Visual Analysis of Neural Network Pruning Approaches. (arXiv:2206.07918v1 [cs.HC])
    Deep learning approaches have provided state-of-the-art performance in many applications by relying on extremely large and heavily overparameterized neural networks. However, such networks have been shown to be very brittle, not generalize well to new uses cases, and are often difficult if not impossible to deploy on resources limited platforms. Model pruning, i.e., reducing the size of the network, is a widely adopted strategy that can lead to more robust and generalizable network -- usually orders of magnitude smaller with the same or even improved performance. While there exist many heuristics for model pruning, our understanding of the pruning process remains limited. Empirical studies show that some heuristics improve performance while others can make models more brittle or have other side effects. This work aims to shed light on how different pruning methods alter the network's internal feature representation, and the corresponding impact on model performance. To provide a meaningful comparison and characterization of model feature space, we use three geometric metrics that are decomposed from the common adopted classification loss. With these metrics, we design a visualization system to highlight the impact of pruning on model prediction as well as the latent feature embedding. The proposed tool provides an environment for exploring and studying differences among pruning methods and between pruned and original model. By leveraging our visualization, the ML researchers can not only identify samples that are fragile to model pruning and data corruption but also obtain insights and explanations on how some pruned models achieve superior robustness performance.
    Applications of Machine Learning to the Identification of Anomalous ER Claims. (arXiv:2206.08093v1 [cs.LG])
    Improper health insurance payments resulting from fraud and upcoding result in tens of billions of dollars in excess health care costs annually in the United States, motivating machine learning researchers to build anomaly detection models for health insurance claims. This article describes two such strategies specifically for ER claims. The first is an upcoding model based on severity code distributions, stratified by hierarchical diagnosis code clusters. A statistically significant difference in mean upcoding anomaly scores is observed between free-standing ERs and acute care hospitals, with free-standing ERs being more anomalous. The second model is a random forest that minimizes improper payments by optimally sorting ER claims within review queues. Depending on the percentage of claims reviewed, the random forest saved 12% to 40% above a baseline approach that prioritized claims by billed amount.
    Using adversarial images to improve outcomes of federated learning for non-IID data. (arXiv:2206.08124v1 [cs.LG])
    One of the important problems in federated learning is how to deal with unbalanced data. This contribution introduces a novel technique designed to deal with label skewed non-IID data, using adversarial inputs, created by the I-FGSM method. Adversarial inputs guide the training process and allow the Weighted Federated Averaging to give more importance to clients with 'selected' local label distributions. Experimental results, gathered from image classification tasks, for MNIST and CIFAR-10 datasets, are reported and analyzed.
    Not All Lotteries Are Made Equal. (arXiv:2206.08175v1 [cs.LG])
    The Lottery Ticket Hypothesis (LTH) states that for a reasonably sized neural network, a sub-network within the same network yields no less performance than the dense counterpart when trained from the same initialization. This work investigates the relation between model size and the ease of finding these sparse sub-networks. We show through experiments that, surprisingly, under a finite budget, smaller models benefit more from Ticket Search (TS).
    Deepfake histological images for enhancing digital pathology. (arXiv:2206.08308v1 [eess.IV])
    An optical microscopic examination of thinly cut stained tissue on glass slides prepared from a FFPE tissue blocks is the gold standard for tissue diagnostics. In addition, the diagnostic abilities and expertise of any pathologist is dependent on their direct experience with common as well as rarer variant morphologies. Recently, deep learning approaches have been used to successfully show a high level of accuracy for such tasks. However, obtaining expert-level annotated images is an expensive and time-consuming task and artificially synthesized histological images can prove greatly beneficial. Here, we present an approach to not only generate histological images that reproduce the diagnostic morphologic features of common disease but also provide a user ability to generate new and rare morphologies. Our approach involves developing a generative adversarial network model that synthesizes pathology images constrained by class labels. We investigated the ability of this framework in synthesizing realistic prostate and colon tissue images and assessed the utility of these images in augmenting diagnostic ability of machine learning methods as well as their usability by a panel of experienced anatomic pathologists. Synthetic data generated by our framework performed similar to real data in training a deep learning model for diagnosis. Pathologists were not able to distinguish between real and synthetic images and showed a similar level of inter-observer agreement for prostate cancer grading. We extended the approach to significantly more complex images from colon biopsies and showed that the complex microenvironment in such tissues can also be reproduced. Finally, we present the ability for a user to generate deepfake histological images via a simple markup of sematic labels.
    Participation and Data Valuation in IoT Data Markets through Distributed Coalitions. (arXiv:2206.07785v1 [cs.NI])
    This paper considers a market for Internet of Things (IoT) data that is used to train machine learning models. The data is supplied to the market platform through a network and the price of the data is controlled based on the value it brings to the machine learning model. We explore the correlation property of data in a game-theoretical setting to eventually derive a simplified distributed solution for a data trading mechanism that emphasizes the mutual benefit of devices and the market. The key proposal is an efficient algorithm for markets that jointly addresses the challenges of availability and heterogeneity in participation, as well as the transfer of trust and the economic value of data exchange in IoT networks. The proposed approach establishes the data market by reinforcing collaboration opportunities between devices with correlated data to avoid information leakage. Therein, we develop a network-wide optimization problem that maximizes the social value of coalition among the IoT devices of similar data types; at the same time, it minimizes the cost due to network externalities, i.e., the impact of information leakage due to data correlation, as well as the opportunity costs. Finally, we reveal the structure of the formulated problem as a distributed coalition game and solve it following the simplified split-and-merge algorithm. Simulation results show the efficacy of our proposed mechanism design toward a trusted IoT data market, with up to 32.72% gain in the average payoff for each seller.
    Generalization Bounds for Data-Driven Numerical Linear Algebra. (arXiv:2206.07886v1 [cs.LG])
    Data-driven algorithms can adapt their internal structure or parameters to inputs from unknown application-specific distributions, by learning from a training sample of inputs. Several recent works have applied this approach to problems in numerical linear algebra, obtaining significant empirical gains in performance. However, no theoretical explanation for their success was known. In this work we prove generalization bounds for those algorithms, within the PAC-learning framework for data-driven algorithm selection proposed by Gupta and Roughgarden (SICOMP 2017). Our main results are closely matching upper and lower bounds on the fat shattering dimension of the learning-based low rank approximation algorithm of Indyk et al.~(NeurIPS 2019). Our techniques are general, and provide generalization bounds for many other recently proposed data-driven algorithms in numerical linear algebra, covering both sketching-based and multigrid-based methods. This considerably broadens the class of data-driven algorithms for which a PAC-learning analysis is available.
    Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence. (arXiv:2206.07892v1 [cs.LG])
    A major challenge in modern machine learning is theoretically understanding the generalization properties of overparameterized models. Many existing tools rely on \em uniform convergence \em (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Nagarajan and Kolter (2019) show that in certain simple linear and neural-network settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails. Our main contribution is proving novel generalization bounds in two such settings, one linear, and one non-linear. We study the linear classification setting of Nagarajan and Kolter, and a quadratic ground truth function learned via a two-layer neural network in the non-linear regime. We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings. Our results show that near-max-margin is important: while any model that achieves at least a $(1 - \epsilon)$-fraction of the max-margin generalizes well, a classifier achieving half of the max-margin may fail terribly. We additionally strengthen the UC impossibility results of Nagarajan and Kolter, proving that \em one-sided \em UC bounds and classical margin bounds will fail on near-max-margin classifiers. Our analysis provides insight on why memorization can coexist with generalization: we show that in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data. The presence of the overfitting components is enough to preclude UC, but the near-extremal margin guarantees that sufficient generalizable components are present.
    EPG2S: Speech Generation and Speech Enhancement based on Electropalatography and Audio Signals using Multimodal Learning. (arXiv:2206.07860v1 [cs.SD])
    Speech generation and enhancement based on articulatory movements facilitate communication when the scope of verbal communication is absent, e.g., in patients who have lost the ability to speak. Although various techniques have been proposed to this end, electropalatography (EPG), which is a monitoring technique that records contact between the tongue and hard palate during speech, has not been adequately explored. Herein, we propose a novel multimodal EPG-to-speech (EPG2S) system that utilizes EPG and speech signals for speech generation and enhancement. Different fusion strategies based on multiple combinations of EPG and noisy speech signals are examined, and the viability of the proposed method is investigated. Experimental results indicate that EPG2S achieves desirable speech generation outcomes based solely on EPG signals. Further, the addition of noisy speech signals is observed to improve quality and intelligibility. Additionally, EPG2S is observed to achieve high-quality speech enhancement based solely on audio signals, with the addition of EPG signals further improving the performance. The late fusion strategy is deemed to be the most effective approach for simultaneous speech generation and enhancement.
    Pure Exploration of Causal Bandits. (arXiv:2206.07883v1 [cs.LG])
    Causal bandit problem integrates causal inference with multi-armed bandits. The pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we can choose to either intervene one variable or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide first gap-dependent fully adaptive pure exploration algorithms on three types of causal models including parallel graphs, general graphs with small number of backdoor parents, and binary generalized linear models. Our algorithms improve both prior causal bandit algorithms, which are not adaptive to reward gaps, and prior adaptive pure exploration algorithms, which do not utilize the special features of causal bandits.
    Conformal prediction set for time-series. (arXiv:2206.07851v1 [stat.ML])
    When building either prediction intervals for regression (with real-valued response) or prediction sets for classification (with categorical responses), uncertainty quantification is essential to studying complex machine learning methods. In this paper, we develop Ensemble Regularized Adaptive Prediction Set (ERAPS) to construct prediction sets for time-series (with categorical responses), based on the prior work of [Xu and Xie, 2021]. In particular, we allow unknown dependencies to exist within features and responses that arrive in sequence. Method-wise, ERAPS is a distribution-free and ensemble-based framework that is applicable for arbitrary classifiers. Theoretically, we bound the coverage gap without assuming data exchangeability and show asymptotic set convergence. Empirically, we demonstrate valid marginal and conditional coverage by ERAPS, which also tends to yield smaller prediction sets than competing methods.
    Multimodal Dialogue State Tracking. (arXiv:2206.07898v1 [cs.AI])
    Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes, and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.
    Discovery and density estimation of latent confounders in Bayesian networks with evidence lower bound. (arXiv:2206.05490v2 [cs.LG] UPDATED)
    Discovering and parameterising latent confounders represent important and challenging problems in causal structure learning and density estimation respectively. In this paper, we focus on both discovering and learning the distribution of latent confounders. This task requires solutions that come from different areas of statistics and machine learning. We combine elements of variational Bayesian methods, expectation-maximisation, hill-climbing search, and structure learning under the assumption of causal insufficiency. We propose two learning strategies; one that maximises model selection accuracy, and another that improves computational efficiency in exchange for minor reductions in accuracy. The former strategy is suitable for small networks and the latter for moderate size networks. Both learning strategies perform well relative to existing solutions.
    A machine learning approach to predicting pore pressure response in liquefiable sands under cyclic loading. (arXiv:2206.07780v1 [physics.geo-ph])
    Shear stress history controls the pore pressure response in liquefiable soils. The excess pore pressure does not increase under cyclic loading when shear stress amplitude is lower than the peak prior amplitude -- the shielding effect. Many sophisticated constitutive models fail to capture the shielding effect observed in the cyclic liquefaction experiments. We develop a data-driven machine learning model based on the LSTM neural network to capture the liquefaction response of soils under cyclic loading. The LSTM model is trained on 12 laboratory cyclic simple shear tests on Nevada sand in loose and dense conditions subjected to different cyclic simple shear loading conditions. The LSTM model features include the relative density of soil and the previous stress history to predict the pore water pressure response. The LSTM model successfully replicates the pore pressure response for three cyclic simple test results considering the shielding and density effects.
    Risk-Averse No-Regret Learning in Online Convex Games. (arXiv:2203.08957v2 [cs.LG] UPDATED)
    We consider an online stochastic game with risk-averse agents whose goal is to learn optimal decisions that minimize the risk of incurring significantly high costs. Specifically, we use the Conditional Value at Risk (CVaR) as a risk measure that the agents can estimate using bandit feedback in the form of the cost values of only their selected actions. Since the distributions of the cost functions depend on the actions of all agents that are generally unobservable, they are themselves unknown and, therefore, the CVaR values of the costs are difficult to compute. To address this challenge, we propose a new online risk-averse learning algorithm that relies on one-point zeroth-order estimation of the CVaR gradients computed using CVaR values that are estimated by appropriately sampling the cost functions. We show that this algorithm achieves sub-linear regret with high probability. We also propose two variants of this algorithm that improve performance. The first variant relies on a new sampling strategy that uses samples from the previous iteration to improve the estimation accuracy of the CVaR values. The second variant employs residual feedback that uses CVaR values from the previous iteration to reduce the variance of the CVaR gradient estimates. We theoretically analyze the convergence properties of these variants and illustrate their performance on an online market problem that we model as a Cournot game.  ( 2 min )
    Contrasting random and learned features in deep Bayesian linear regression. (arXiv:2203.00573v2 [cs.LG] UPDATED)
    Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.  ( 2 min )
    Generalizing to Evolving Domains with Latent Structure-Aware Sequential Autoencoder. (arXiv:2205.07649v2 [cs.LG] UPDATED)
    Domain generalization aims to improve the generalization capability of machine learning systems to out-of-distribution (OOD) data. Existing domain generalization techniques embark upon stationary and discrete environments to tackle the generalization issue caused by OOD data. However, many real-world tasks in non-stationary environments (e.g. self-driven car system, sensor measures) involve more complex and continuously evolving domain drift, which raises new challenges for the problem of domain generalization. In this paper, we formulate the aforementioned setting as the problem of evolving domain generalization. Specifically, we propose to introduce a probabilistic framework called Latent Structure-aware Sequential Autoencoder (LSSAE) to tackle the problem of evolving domain generalization via exploring the underlying continuous structure in the latent space of deep neural networks, where we aim to identify two major factors namely covariate shift and concept shift accounting for distribution shift in non-stationary environments. Experimental results on both synthetic and real-world datasets show that LSSAE can lead to superior performances based on the evolving domain generalization setting.  ( 2 min )
    Convergence of Policy Gradient for Entropy Regularized MDPs with Neural Network Approximation in the Mean-Field Regime. (arXiv:2201.07296v2 [math.OC] UPDATED)
    We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, and entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (one-hidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Further, we prove that if the regularization in terms of the mean-field measure is sufficient, the gradient flow converges exponentially fast to the unique stationary solution, which is the unique maximizer of the regularized MDP objective. Lastly, we study the sensitivity of the value function along the gradient flow with respect to regularization parameters and the initial condition. Our results rely on the careful analysis of the non-linear Fokker-Planck-Kolmogorov equation and extend the pioneering work of Mei et al. 2020 and Agarwal et al. 2020, which quantify the global convergence rate of policy gradient for entropy-regularized MDPs in the tabular setting.  ( 2 min )
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v3 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.  ( 2 min )
    Masked-attention Mask Transformer for Universal Image Segmentation. (arXiv:2112.01527v3 [cs.CV] UPDATED)
    Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).  ( 2 min )
    An Asymptotic Test for Conditional Independence using Analytic Kernel Embeddings. (arXiv:2110.14868v2 [stat.ML] UPDATED)
    We propose a new conditional dependence measure and a statistical test for conditional independence. The measure is based on the difference between analytic kernel embeddings of two well-suited distributions evaluated at a finite set of locations. We obtain its asymptotic distribution under the null hypothesis of conditional independence and design a consistent statistical test from it. We conduct a series of experiments showing that our new test outperforms state-of-the-art methods both in terms of type-I and type-II errors even in the high dimensional setting.  ( 2 min )
    Interpretable and Generalizable Graph Learning via Stochastic Attention Mechanism. (arXiv:2201.12987v2 [cs.LG] UPDATED)
    Interpretable graph learning is in need as many scientific applications depend on learning models to collect insights from graph-structured data. Previous works mostly focused on using post-hoc approaches to interpret a pre-trained model (graph neural network models in particular). They argue against inherently interpretable models because good interpretation of these models is often at the cost of their prediction accuracy. And, the widely used attention mechanism for inherent interpretation often fails to provide faithful interpretation in graph learning tasks. In this work, we address both issues by proposing Graph Stochastic Attention (GSAT), an attention mechanism derived from the information bottleneck principle. GSAT leverages stochastic attention to block the information from the task-irrelevant graph components while learning stochasticity-reduced attention to select the task-relevant subgraphs for interpretation. GSAT can also apply to fine-tuning and interpreting pre-trained models via stochastic attention mechanism. Extensive experiments on eight datasets show that GSAT outperforms the state-of-the-art methods by up to 20%$\uparrow$ in interpretation AUC and 5%$\uparrow$ in prediction accuracy.  ( 2 min )
    Robustness and Accuracy Could Be Reconcilable by (Proper) Definition. (arXiv:2202.10103v2 [cs.LG] UPDATED)
    The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance -- an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models. Code is available at https://github.com/P2333/SCORE.  ( 2 min )
    Neural Enhanced Belief Propagation for Data Association in Multiobject Tracking. (arXiv:2203.09948v3 [cs.CV] UPDATED)
    Situation-aware technologies enabled by multiobject tracking (MOT) methods will create new services and applications in fields such as autonomous navigation and applied ocean sciences. Belief propagation (BP) is a state-of-the-art method for Bayesian MOT but fully relies on a statistical model and preprocessed sensor measurements. In this paper, we establish a hybrid method for model-based and data-driven MOT. The proposed neural enhanced belief propagation (NEBP) approach complements BP by information learned from raw sensor data with the goal to improve data association and to reject false alarm measurements. We evaluate the performance of our NEBP approach for MOT on the nuScenes autonomous driving dataset and demonstrate that it can outperform state-of-the-art reference methods.  ( 2 min )
    Deep Reinforcement Learning, a textbook. (arXiv:2201.02135v3 [cs.AI] UPDATED)
    Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. In all these fields, computer programs have taught themselves to solve difficult problems. They have learned to fly model helicopters and perform aerobatic manoeuvers such as loops and rolls. In some applications they have even become better than the best humans, such as in Atari, Go, poker and StarCraft. The way in which deep reinforcement learning explores complex environments reminds us of how children learn, by playfully trying out things, getting feedback, and trying again. The computer seems to truly possess aspects of human learning; this goes to the heart of the dream of artificial intelligence. The successes in research have not gone unnoticed by educators, and universities have started to offer courses on the subject. The aim of this book is to provide a comprehensive overview of the field of deep reinforcement learning. The book is written for graduate students of artificial intelligence, and for researchers and practitioners who wish to better understand deep reinforcement learning methods and their challenges. We assume an undergraduate-level of understanding of computer science and artificial intelligence; the programming language of this book is Python. We describe the foundations, the algorithms and the applications of deep reinforcement learning. We cover the established model-free and model-based methods that form the basis of the field. Developments go quickly, and we also cover advanced topics: deep multi-agent reinforcement learning, deep hierarchical reinforcement learning, and deep meta learning.  ( 2 min )
    Off-Policy Evaluation for Large Action Spaces via Embeddings. (arXiv:2202.06317v2 [cs.LG] UPDATED)
    Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.  ( 2 min )
    Model Zoo: A Growing "Brain" That Learns Continually. (arXiv:2106.03027v3 [cs.LG] UPDATED)
    This paper argues that continual learning methods can benefit by splitting the capacity of the learner across multiple models. We use statistical learning theory and experimental analysis to show how multiple tasks can interact with each other in a non-trivial fashion when a single model is trained on them. The generalization error on a particular task can improve when it is trained with synergistic tasks, but can also deteriorate when trained with competing tasks. This theory motivates our method named Model Zoo which, inspired from the boosting literature, grows an ensemble of small models, each of which is trained during one episode of continual learning. We demonstrate that Model Zoo obtains large gains in accuracy on a variety of continual learning benchmark problems. Code is available at https://github.com/grasp-lyrl/modelzoo_continual.  ( 2 min )
    Flowformer: Linearizing Transformers with Conservation Flows. (arXiv:2202.06258v2 [cs.LG] UPDATED)
    Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation into attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning. The code and settings are available at this repository: https://github.com/thuml/Flowformer.  ( 2 min )
    Graph Signal Reconstruction Techniques for IoT Air Pollution Monitoring Platforms. (arXiv:2201.00378v2 [eess.SP] UPDATED)
    Air pollution monitoring platforms play a very important role in preventing and mitigating the effects of pollution. Recent advances in the field of graph signal processing have made it possible to describe and analyze air pollution monitoring networks using graphs. One of the main applications is the reconstruction of the measured signal in a graph using a subset of sensors. Reconstructing the signal using information from sensor neighbors can help improve the quality of network data, examples are filling in missing data with correlated neighboring nodes, or correcting a drifting sensor with neighboring sensors that are more accurate. This paper compares the use of various types of graph signal reconstruction methods applied to real data sets of Spanish air pollution reference stations. The methods considered are Laplacian interpolation, graph signal processing low-pass based graph signal reconstruction, and kernel-based graph signal reconstruction, and are compared on actual air pollution data sets measuring O3, NO2, and PM10. The ability of the methods to reconstruct the signal of a pollutant is shown, as well as the computational cost of this reconstruction. The results indicate the superiority of methods based on kernel-based graph signal reconstruction, as well as the difficulties of the methods to scale in an air pollution monitoring network with a large number of low-cost sensors. However, we show that scalability can be overcome with simple methods, such as partitioning the network using a clustering algorithm.  ( 2 min )
    FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. (arXiv:2201.12740v3 [cs.LG] UPDATED)
    Although Transformer-based methods have significantly improved state-of-the-art results for long-term series forecasting, they are not only computationally expensive but more importantly, are unable to capture the global view of time series (e.g. overall trend). To address these problems, we propose to combine Transformer with the seasonal-trend decomposition method, in which the decomposition method captures the global profile of time series while Transformers capture more detailed structures. To further enhance the performance of Transformer for long-term prediction, we exploit the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform, and develop a frequency enhanced Transformer. Besides being more effective, the proposed method, termed as Frequency Enhanced Decomposed Transformer ({\bf FEDformer}), is more efficient than standard Transformer with a linear complexity to the sequence length. Our empirical studies with six benchmark datasets show that compared with state-of-the-art methods, FEDformer can reduce prediction error by $14.8\%$ and $22.6\%$ for multivariate and univariate time series, respectively. Code is publicly available at https://github.com/MAZiqing/FEDformer.  ( 2 min )
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v2 [stat.ML] UPDATED)
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.  ( 2 min )
    Cyclical Focal Loss. (arXiv:2202.08978v2 [cs.CV] UPDATED)
    The cross-entropy softmax loss is the primary loss function used to train deep neural networks. On the other hand, the focal loss function has been demonstrated to provide improved performance when there is an imbalance in the number of training samples in each class, such as in long-tailed datasets. In this paper, we introduce a novel cyclical focal loss and demonstrate that it is a more universal loss function than cross-entropy softmax loss or focal loss. We describe the intuition behind the cyclical focal loss and our experiments provide evidence that cyclical focal loss provides superior performance for balanced, imbalanced, or long-tailed datasets. We provide numerous experimental results for CIFAR-10/CIFAR-100, ImageNet, balanced and imbalanced 4,000 training sample versions of CIFAR-10/CIFAR-100, and ImageNet-LT and Places-LT from the Open Long-Tailed Recognition (OLTR) challenge. Implementing the cyclical focal loss function requires only a few lines of code and does not increase training time. In the spirit of reproducibility, our code is available at \url{https://github.com/lnsmith54/CFL}.  ( 2 min )
    Wild ToFu: Improving Range and Quality of Indirect Time-of-Flight Depth with RGB Fusion in Challenging Environments. (arXiv:2112.03750v2 [cs.CV] UPDATED)
    Indirect Time-of-Flight (I-ToF) imaging is a widespread way of depth estimation for mobile devices due to its small size and affordable price. Previous works have mainly focused on quality improvement for I-ToF imaging especially curing the effect of Multi Path Interference (MPI). These investigations are typically done in specifically constrained scenarios at close distance, indoors and under little ambient light. Surprisingly little work has investigated I-ToF quality improvement in real-life scenarios where strong ambient light and far distances pose difficulties due to an extreme amount of induced shot noise and signal sparsity, caused by the attenuation with limited sensor power and light scattering. In this work, we propose a new learning based end-to-end depth prediction network which takes noisy raw I-ToF signals as well as an RGB image and fuses their latent representation based on a multi step approach involving both implicit and explicit alignment to predict a high quality long range depth map aligned to the RGB viewpoint. We test our approach on challenging real-world scenes and show more than 40% RMSE improvement on the final depth map compared to the baseline approach.  ( 2 min )
    Continual Repeated Annealed Flow Transport Monte Carlo. (arXiv:2201.13117v2 [stat.ML] UPDATED)
    We propose Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), a method that combines a sequential Monte Carlo (SMC) sampler (itself a generalization of Annealed Importance Sampling) with variational inference using normalizing flows. The normalizing flows are directly trained to transport between annealing temperatures using a KL divergence for each transition. This optimization objective is itself estimated using the normalizing flow/SMC approximation. We show conceptually and using multiple empirical examples that CRAFT improves on Annealed Flow Transport Monte Carlo (Arbel et al., 2021), on which it builds and also on Markov chain Monte Carlo (MCMC) based Stochastic Normalizing Flows (Wu et al., 2020). By incorporating CRAFT within particle MCMC, we show that such learnt samplers can achieve impressively accurate results on a challenging lattice field theory example.  ( 2 min )
  • Open

    mlf-core: a framework for deterministic machine learning. (arXiv:2104.07651v2 [cs.MS] UPDATED)
    Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. However, major machine learning libraries default to the usage of non-deterministic algorithms based on atomic operations. Solely fixing all random seeds is not sufficient for deterministic machine learning. To overcome this shortcoming, various machine learning libraries released deterministic counterparts to the non-deterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in CT scans, and a liver cancer classifier based on gene expression profiles with XGBoost.
    Continual Repeated Annealed Flow Transport Monte Carlo. (arXiv:2201.13117v2 [stat.ML] UPDATED)
    We propose Continual Repeated Annealed Flow Transport Monte Carlo (CRAFT), a method that combines a sequential Monte Carlo (SMC) sampler (itself a generalization of Annealed Importance Sampling) with variational inference using normalizing flows. The normalizing flows are directly trained to transport between annealing temperatures using a KL divergence for each transition. This optimization objective is itself estimated using the normalizing flow/SMC approximation. We show conceptually and using multiple empirical examples that CRAFT improves on Annealed Flow Transport Monte Carlo (Arbel et al., 2021), on which it builds and also on Markov chain Monte Carlo (MCMC) based Stochastic Normalizing Flows (Wu et al., 2020). By incorporating CRAFT within particle MCMC, we show that such learnt samplers can achieve impressively accurate results on a challenging lattice field theory example.
    Feature Selection using e-values. (arXiv:2206.05391v2 [stat.ML] UPDATED)
    In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not. The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a $p$-dimensional feature space, this procedure requires fitting only the full model and evaluating $p+1$ models, as opposed to the traditional requirement of fitting and evaluating $2^p$ models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection.
    HyperImpute: Generalized Iterative Imputation with Automatic Model Selection. (arXiv:2206.07769v1 [stat.ML])
    Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v3 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.
    On the Surprising Behaviour of node2vec. (arXiv:2206.08252v1 [cs.LG])
    Graph embedding techniques are a staple of modern graph learning research. When using embeddings for downstream tasks such as classification, information about their stability and robustness, i.e., their susceptibility to sources of noise, stochastic effects, or specific parameter choices, becomes increasingly important. As one of the most prominent graph embedding schemes, we focus on node2vec and analyse its embedding quality from multiple perspectives. Our findings indicate that embedding quality is unstable with respect to parameter choices, and we propose strategies to remedy this in practice.
    The convergent Indian buffet process. (arXiv:2206.08002v1 [stat.ML])
    We propose a new Bayesian nonparametric prior for latent feature models, which we call the convergent Indian buffet process (CIBP). We show that under the CIBP, the number of latent features is distributed as a Poisson distribution with the mean monotonically increasing but converging to a certain value as the number of objects goes to infinity. That is, the expected number of features is bounded above even when the number of objects goes to infinity, unlike the standard Indian buffet process under which the expected number of features increases with the number of objects. We provide two alternative representations of the CIBP based on a hierarchical distribution and a completely random measure, respectively, which are of independent interest. The proposed CIBP is assessed on a high-dimensional sparse factor model.
    Causal discovery under a confounder blanket. (arXiv:2205.05715v2 [stat.ME] UPDATED)
    Inferring causal relationships from observational data is rarely straightforward, but the problem is especially difficult in high dimensions. For these applications, causal discovery algorithms typically require parametric restrictions or extreme sparsity constraints. We relax these assumptions and focus on an important but more specialized problem, namely recovering the causal order among a subgraph of variables known to descend from some (possibly large) set of confounding covariates, i.e. a $\textit{confounder blanket}$. This is useful in many settings, for example when studying a dynamic biomolecular subsystem with genetic data providing background information. Under a structural assumption called the $\textit{confounder blanket principle}$, which we argue is essential for tractable causal discovery in high dimensions, our method accommodates graphs of low or high sparsity while maintaining polynomial time complexity. We present a structure learning algorithm that is provably sound and complete with respect to a so-called $\textit{lazy oracle}$. We design inference procedures with finite sample error control for linear and nonlinear systems, and demonstrate our approach on a range of simulated and real-world datasets. An accompanying $\texttt{R}$ package, $\texttt{cbl}$, is available from $\texttt{CRAN}$.
    Three rates of convergence or separation via U-statistics in a dependent framework. (arXiv:2106.12796v2 [math.ST] UPDATED)
    Despite the ubiquity of U-statistics in modern Probability and Statistics, their non-asymptotic analysis in a dependent framework may have been overlooked. In a recent work, a new concentration inequality for U-statistics of order two for uniformly ergodic Markov chains has been proved. In this paper, we put this theoretical breakthrough into action by pushing further the current state of knowledge in three different active fields of research. First, we establish a new exponential inequality for the estimation of spectra of trace class integral operators with MCMC methods. The novelty is that this result holds for kernels with positive and negative eigenvalues, which is new as far as we know. In addition, we investigate generalization performance of online algorithms working with pairwise loss functions and Markov chain samples. We provide an online-to-batch conversion result by showing how we can extract a low risk hypothesis from the sequence of hypotheses generated by any online learner. We finally give a non-asymptotic analysis of a goodness-of-fit test on the density of the invariant measure of a Markov chain. We identify some classes of alternatives over which our test based on the $L_2$ distance has a prescribed power.
    Multimeasurement Generative Models. (arXiv:2112.09822v2 [stat.ML] UPDATED)
    We formally map the problem of sampling from an unknown distribution with a density in $\mathbb{R}^d$ to the problem of learning and sampling a smoother density in $\mathbb{R}^{Md}$ obtained by convolution with a fixed factorial kernel: the new density is referred to as M-density and the kernel as multimeasurement noise model (MNM). The M-density in $\mathbb{R}^{Md}$ is smoother than the original density in $\mathbb{R}^d$, easier to learn and sample from, yet for large $M$ the two problems are mathematically equivalent since clean data can be estimated exactly given a multimeasurement noisy observation using the Bayes estimator. To formulate the problem, we derive the Bayes estimator for Poisson and Gaussian MNMs in closed form in terms of the unnormalized M-density. This leads to a simple least-squares objective for learning parametric energy and score functions. We present various parametrization schemes of interest including one in which studying Gaussian M-densities directly leads to multidenoising autoencoders--this is the first theoretical connection made between denoising autoencoders and empirical Bayes in the literature. Samples in $\mathbb{R}^d$ are obtained by walk-jump sampling (Saremi & Hyvarinen, 2019) via underdamped Langevin MCMC (walk) to sample from M-density and the multimeasurement Bayes estimation (jump). We study permutation invariant Gaussian M-densities on MNIST, CIFAR-10, and FFHQ-256 datasets, and demonstrate the effectiveness of this framework for realizing fast-mixing stable Markov chains in high dimensions.
    Convergence of Policy Gradient for Entropy Regularized MDPs with Neural Network Approximation in the Mean-Field Regime. (arXiv:2201.07296v2 [math.OC] UPDATED)
    We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, and entropy-regularized Markov decision processes (MDPs). We consider a softmax policy with (one-hidden layer) neural network approximation in a mean-field regime. Additional entropic regularization in the associated mean-field probability measure is added, and the corresponding gradient flow is studied in the 2-Wasserstein metric. We show that the objective function is increasing along the gradient flow. Further, we prove that if the regularization in terms of the mean-field measure is sufficient, the gradient flow converges exponentially fast to the unique stationary solution, which is the unique maximizer of the regularized MDP objective. Lastly, we study the sensitivity of the value function along the gradient flow with respect to regularization parameters and the initial condition. Our results rely on the careful analysis of the non-linear Fokker-Planck-Kolmogorov equation and extend the pioneering work of Mei et al. 2020 and Agarwal et al. 2020, which quantify the global convergence rate of policy gradient for entropy-regularized MDPs in the tabular setting.
    Neural tangent kernel analysis of shallow $\alpha$-Stable ReLU neural networks. (arXiv:2206.08065v1 [cs.LG])
    There is a recent literature on large-width properties of Gaussian neural networks (NNs), i.e. NNs whose weights are distributed according to Gaussian distributions. Two popular problems are: i) the study of the large-width behaviour of NNs, which provided a characterization of the infinitely wide limit of a rescaled NN in terms of a Gaussian process; ii) the study of the large-width training dynamics of NNs, which set forth an equivalence between training the rescaled NN and performing a kernel regression with a deterministic kernel referred to as the neural tangent kernel (NTK). In this paper, we consider these problems for $\alpha$-Stable NNs, which generalize Gaussian NNs by assuming that the NN's weights are distributed as $\alpha$-Stable distributions with $\alpha\in(0,2]$, i.e. distributions with heavy tails. For shallow $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable process, i.e. a stochastic process with $\alpha$-Stable finite-dimensional distributions. As a novelty with respect to the Gaussian setting, in the $\alpha$-Stable setting the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU function requires an additional logarithmic scaling with respect to sub-linear functions. Then, our main contribution is the NTK analysis of shallow $\alpha$-Stable ReLU-NNs, which leads to an equivalence between training a rescaled NN and performing a kernel regression with an $(\alpha/2)$-Stable random kernel. The randomness of such a kernel is a further novelty with respect to the Gaussian setting, that is: in the $\alpha$-Stable setting the randomness of the NN at initialization does not vanish in the NTK analysis, thus inducing a distribution for the kernel of the underlying kernel regression.
    Deep Learning for Time Series Forecasting: Tutorial and Literature Survey. (arXiv:2004.10240v2 [cs.LG] UPDATED)
    Deep learning based forecasting methods have become the methods of choice in many applications of time series prediction or forecasting often outperforming other approaches. Consequently, over the last years, these methods are now ubiquitous in large-scale industrial forecasting applications and have consistently ranked among the best entries in forecasting competitions (e.g., M4 and M5). This practical success has further increased the academic interest to understand and improve deep forecasting methods. In this article we provide an introduction and overview of the field: We present important building blocks for deep forecasting in some depth; using these building blocks, we then survey the breadth of the recent deep forecasting literature.
    Pareto Invariant Risk Minimization. (arXiv:2206.07766v1 [cs.LG])
    Despite the success of invariant risk minimization (IRM) in tackling the Out-of-Distribution generalization problem, IRM can compromise the optimality when applied in practice. The practical variants of IRM, e.g., IRMv1, have been shown to have significant gaps with IRM and thus could fail to capture the invariance even in simple problems. Moreover, the optimization procedure in IRMv1 involves two intrinsically conflicting objectives, and often requires careful tuning for the objective weights. To remedy the above issues, we reformulate IRM as a multi-objective optimization problem, and propose a new optimization scheme for IRM, called PAreto Invariant Risk Minimization (PAIR). PAIR can adaptively adjust the optimization direction under the objective conflicts. Furthermore, we show PAIR can empower the practical IRM variants to overcome the barriers with the original IRM when provided with proper guidance. We conduct experiments with ColoredMNIST to confirm our theory and the effectiveness of PAIR.
    Squeeze All: Novel Estimator and Self-Normalized Bound for Linear Contextual Bandits. (arXiv:2206.05404v2 [stat.ML] UPDATED)
    We propose a novel algorithm for linear contextual bandits with $O(\sqrt{dT \log T})$ regret bound, where $d$ is the dimension of contexts and $T$ is the time horizon. Our proposed algorithm is equipped with a novel estimator in which exploration is embedded through explicit randomization. Depending on the randomization, our proposed estimator takes contribution either from contexts of all arms or from selected contexts. We establish a self-normalized bound for our estimator, which allows a novel decomposition of the cumulative regret into additive dimension-dependent terms instead of multiplicative terms. We also prove a novel lower bound of $\Omega(\sqrt{dT})$ under our problem setting. Hence, the regret of our proposed algorithm matches the lower bound up to logarithmic factors. The numerical experiments support the theoretical guarantees and show that our proposed method outperforms the existing linear bandit algorithms.
    Multi-Objective Bayesian Optimization over High-Dimensional Search Spaces. (arXiv:2109.10964v4 [cs.LG] UPDATED)
    Many real world scientific and industrial applications require optimizing multiple competing black-box objectives. When the objectives are expensive-to-evaluate, multi-objective Bayesian optimization (BO) is a popular approach because of its high sample efficiency. However, even with recent methodological advances, most existing multi-objective BO methods perform poorly on search spaces with more than a few dozen parameters and rely on global surrogate models that scale cubically with the number of observations. In this work we propose MORBO, a scalable method for multi-objective BO over high-dimensional search spaces. MORBO identifies diverse globally optimal solutions by performing BO in multiple local regions of the design space in parallel using a coordinated strategy. We show that MORBO significantly advances the state-of-the-art in sample efficiency for several high-dimensional synthetic problems and real world applications, including an optical display design problem and a vehicle design problem with 146 and 222 parameters, respectively. On these problems, where existing BO algorithms fail to scale and perform well, MORBO provides practitioners with order-of-magnitude improvements in sample efficiency over the current approach.
    On Privacy and Personalization in Cross-Silo Federated Learning. (arXiv:2206.07902v1 [cs.LG])
    While the application of differential privacy (DP) has been well-studied in cross-device federated learning (FL), there is a lack of work considering DP for cross-silo FL, a setting characterized by a limited number of clients each containing many data subjects. In cross-silo FL, usual notions of client-level privacy are less suitable as real-world privacy regulations typically concern in-silo data subjects rather than the silos themselves. In this work, we instead consider the more realistic notion of silo-specific item-level privacy, where silos set their own privacy targets for their local examples. Under this setting, we reconsider the roles of personalization in federated learning. In particular, we show that mean-regularized multi-task learning (MR-MTL), a simple personalization framework, is a strong baseline for cross-silo FL: under stronger privacy, silos are further incentivized to "federate" with each other to mitigate DP noise, resulting in consistent improvements relative to standard baseline methods. We provide a thorough empirical study of competing methods as well as a theoretical characterization of MR-MTL for a mean estimation problem, highlighting the interplay between privacy and cross-silo data heterogeneity. Our work serves to establish baselines for private cross-silo FL as well as identify key directions of future work in this area.
    Large-Scale Differentiable Causal Discovery of Factor Graphs. (arXiv:2206.07824v1 [stat.ML])
    A common theme in causal inference is learning causal relationships between observed variables, also known as causal discovery. This is usually a daunting task, given the large number of candidate causal graphs and the combinatorial nature of the search space. Perhaps for this reason, most research has so far focused on relatively small causal graphs, with up to hundreds of nodes. However, recent advances in fields like biology enable generating experimental data sets with thousands of interventions followed by rich profiling of thousands of variables, raising the opportunity and urgent need for large causal graph models. Here, we introduce the notion of factor directed acyclic graphs (f-DAGs) as a way to restrict the search space to non-linear low-rank causal interaction models. Combining this novel structural assumption with recent advances that bridge the gap between causal discovery and continuous optimization, we achieve causal discovery on thousands of variables. Additionally, as a model for the impact of statistical noise on this estimation procedure, we study a model of edge perturbations of the f-DAG skeleton based on random graphs and quantify the effect of such perturbations on the f-DAG rank. This theoretical analysis suggests that the set of candidate f-DAGs is much smaller than the whole DAG space and thus more statistically robust in the high-dimensional regime where the underlying skeleton is hard to assess. We propose Differentiable Causal Discovery of Factor Graphs (DCD-FG), a scalable implementation of f-DAG constrained causal discovery for high-dimensional interventional data. DCD-FG uses a Gaussian non-linear low-rank structural equation model and shows significant improvements compared to state-of-the-art methods in both simulations as well as a recent large-scale single-cell RNA sequencing data set with hundreds of genetic interventions.
    A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. (arXiv:2111.06784v4 [cs.LG] UPDATED)
    We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail.
    Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation. (arXiv:2206.08366v1 [cs.LG])
    Bayesian Optimization (BO) has shown great promise for the global optimization of functions that are expensive to evaluate, but despite many successes, standard approaches can struggle in high dimensions. To improve the performance of BO, prior work suggested incorporating gradient information into a Gaussian process surrogate of the objective, giving rise to kernel matrices of size $nd \times nd$ for $n$ observations in $d$ dimensions. Na\"ively multiplying with (resp. inverting) these matrices requires $\mathcal{O}(n^2d^2)$ (resp. $\mathcal{O}(n^3d^3$)) operations, which becomes infeasible for moderate dimensions and sample sizes. Here, we observe that a wide range of kernels gives rise to structured matrices, enabling an exact $\mathcal{O}(n^2d)$ matrix-vector multiply for gradient observations and $\mathcal{O}(n^2d^2)$ for Hessian observations. Beyond canonical kernel classes, we derive a programmatic approach to leveraging this type of structure for transformations and combinations of the discussed kernel classes, which constitutes a structure-aware automatic differentiation algorithm. Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels without any additional derivations, enabling flexible, problem-dependent modeling while scaling first-order BO to high $d$.
    Off-Policy Evaluation for Large Action Spaces via Embeddings. (arXiv:2202.06317v2 [cs.LG] UPDATED)
    Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems, since it enables offline evaluation of new policies using only historic log data. Unfortunately, when the number of actions is large, existing OPE estimators -- most of which are based on inverse propensity score weighting -- degrade severely and can suffer from extreme bias and variance. This foils the use of OPE in many applications from recommender systems to language models. To overcome this issue, we propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space. We characterize the bias, variance, and mean squared error of the proposed estimator and analyze the conditions under which the action embedding provides statistical benefits over conventional estimators. In addition to the theoretical analysis, we find that the empirical performance improvement can be substantial, enabling reliable OPE even when existing estimators collapse due to a large number of actions.
    Contrasting random and learned features in deep Bayesian linear regression. (arXiv:2203.00573v2 [cs.LG] UPDATED)
    Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display sample-wise double-descent behavior in the presence of label noise. Random feature models can also display model-wise double-descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.
    Towards Robust and Reproducible Active Learning Using Neural Networks. (arXiv:2002.09564v3 [cs.LG] UPDATED)
    Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we demonstrate that under identical experimental settings, different types of AL algorithms (uncertainty based, diversity based, and committee based) produce an inconsistent gain over random sampling baseline. Through a variety of experiments, controlling for sources of stochasticity, we show that variance in performance metrics achieved by AL algorithms can lead to results that are not consistent with the previously reported results. We also found that under strong regularization, AL methods show marginal or no advantage over the random sampling baseline under a variety of experimental conditions. Finally, we conclude with a set of recommendations on how to assess the results using a new AL algorithm to ensure results are reproducible and robust under changes in experimental conditions. We share our codes to facilitate AL evaluations. We believe our findings and recommendations will help advance reproducible research in AL using neural networks. We open source our code at https://github.com/PrateekMunjal/TorchAL
    General Cyclical Training of Neural Networks. (arXiv:2202.08835v2 [cs.LG] UPDATED)
    This paper describes the principle of "General Cyclical Training" in machine learning, where training starts and ends with "easy training" and the "hard training" happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at \url{https://github.com/lnsmith54/CFL}.
    Learning Multi-Task Gaussian Process Over Heterogeneous Input Domains. (arXiv:2202.12636v2 [stat.ML] UPDATED)
    Multi-task Gaussian process (MTGP) is a well-known non-parametric Bayesian model for learning correlated tasks effectively by transferring knowledge across tasks. But current MTGPs are usually limited to the multi-task scenario defined in the same input domain, leaving no space for tackling the heterogeneous case, i.e., the features of input domains vary over tasks. To this end, this paper presents a novel heterogeneous stochastic variational linear model of coregionalization (\texttt{HSVLMC}) model for simultaneously learning the tasks with varied input domains. Particularly, we develop the stochastic variational framework with Bayesian calibration that (i) takes into account the effect of dimensionality reduction raised by domain mappings in order to achieve effective input alignment; and (ii) employs a residual modeling strategy to leverage the inductive bias brought by prior domain mappings for better model inference. Finally, the superiority of the proposed model against existing LMC models has been extensively verified on diverse heterogeneous multi-task cases and a practical multi-fidelity steam turbine exhaust problem.
    Unlocking High-Accuracy Differentially Private Image Classification through Scale. (arXiv:2204.13650v2 [cs.LG] UPDATED)
    Differential Privacy (DP) provides a formal privacy guarantee preventing adversaries with access to a machine learning model from extracting information about individual training points. Differentially Private Stochastic Gradient Descent (DP-SGD), the most popular DP training method for deep learning, realizes this protection by injecting noise during training. However previous works have found that DP-SGD often leads to a significant degradation in performance on standard image classification benchmarks. Furthermore, some authors have postulated that DP-SGD inherently performs poorly on large models, since the norm of the noise required to preserve privacy is proportional to the model dimension. In contrast, we demonstrate that DP-SGD on over-parameterized models can perform significantly better than previously thought. Combining careful hyper-parameter tuning with simple techniques to ensure signal propagation and improve the convergence rate, we obtain a new SOTA without extra data on CIFAR-10 of 81.4% under (8, 10^{-5})-DP using a 40-layer Wide-ResNet, improving over the previous SOTA of 71.7%. When fine-tuning a pre-trained NFNet-F3, we achieve a remarkable 83.8% top-1 accuracy on ImageNet under (0.5, 8*10^{-7})-DP. Additionally, we also achieve 86.7% top-1 accuracy under (8, 8 \cdot 10^{-7})-DP, which is just 4.3% below the current non-private SOTA for this task. We believe our results are a significant step towards closing the accuracy gap between private and non-private image classification.
    User Engagement and Churn in Mobile Health Applications. (arXiv:2206.08178v1 [stat.ML])
    Mobile health apps are revolutionizing the healthcare ecosystem by improving communication, efficiency, and quality of service. In low- and middle-income countries, they also play a unique role as a source of information about health outcomes and behaviors of patients and healthcare workers, while providing a suitable channel to deliver both personalized and collective policy interventions. We propose a framework to study user engagement with mobile health, focusing on healthcare workers and digital health apps designed to support them in resource-poor settings. The behavioral logs produced by these apps can be transformed into daily time series characterizing each user's activity. We use probabilistic and survival analysis to build multiple personalized measures of meaningful engagement, which could serve to tailor content and digital interventions suiting each health worker's specific needs. Special attention is given to the problem of detecting churn, understood as a marker of complete disengagement. We discuss the application of our methods to the Indian and Ethiopian users of the Safe Delivery App, a capacity-building tool for skilled birth attendants. This work represents an important step towards a full characterization of user engagement in mobile health applications, which can significantly enhance the abilities of health workers and, ultimately, save lives.
    Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching. (arXiv:2206.08265v1 [stat.ML])
    Score-based generative models have excellent performance in terms of generation quality and likelihood. They model the data distribution by matching a parameterized score network with first-order data score functions. The score network can be used to define an ODE ("score-based diffusion ODE") for exact likelihood evaluation. However, the relationship between the likelihood of the ODE and the score matching objective is unclear. In this work, we prove that matching the first-order score is not sufficient to maximize the likelihood of the ODE, by showing a gap between the maximum likelihood and score matching objectives. To fill up this gap, we show that the negative likelihood of the ODE can be bounded by controlling the first, second, and third-order score matching errors; and we further present a novel high-order denoising score matching method to enable maximum likelihood training of score-based diffusion ODEs. Our algorithm guarantees that the higher-order matching error is bounded by the training error and the lower-order errors. We empirically observe that by high-order score matching, score-based diffusion ODEs achieve better likelihood on both synthetic data and CIFAR-10, while retaining the high generation quality.
    Deep Bayesian inference for seismic imaging with tasks. (arXiv:2110.04825v3 [physics.geo-ph] UPDATED)
    We propose to use techniques from Bayesian inference and deep neural networks to translate uncertainty in seismic imaging to uncertainty in tasks performed on the image, such as horizon tracking. Seismic imaging is an ill-posed inverse problem because of bandwidth and aperture limitations, which is hampered by the presence of noise and linearization errors. Many regularization methods, such as transform-domain sparsity promotion, have been designed to deal with the adverse effects of these errors, however, these methods run the risk of biasing the solution and do not provide information on uncertainty in the image space and how this uncertainty impacts certain tasks on the image. A systematic approach is proposed to translate uncertainty due to noise in the data to confidence intervals of automatically tracked horizons in the image. The uncertainty is characterized by a convolutional neural network (CNN) and to assess these uncertainties, samples are drawn from the posterior distribution of the CNN weights, used to parameterize the image. Compared to traditional priors, it is argued in the literature that these CNNs introduce a flexible inductive bias that is a surprisingly good fit for a diverse set of problems. The method of stochastic gradient Langevin dynamics is employed to sample from the posterior distribution. This method is designed to handle large scale Bayesian inference problems with computationally expensive forward operators as in seismic imaging. Aside from offering a robust alternative to maximum a posteriori estimate that is prone to overfitting, access to these samples allow us to translate uncertainty in the image, due to noise in the data, to uncertainty on the tracked horizons. For instance, it admits estimates for the pointwise standard deviation on the image and for confidence intervals on its automatically tracked horizons.
    Solving Inverse Problems in Medical Imaging with Score-Based Generative Models. (arXiv:2111.08005v2 [eess.IV] UPDATED)
    Reconstructing medical images from partial measurements is an important inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing solutions based on machine learning typically train a model to directly map measurements to medical images, leveraging a training dataset of paired images and measurements. These measurements are typically synthesized from images using a fixed physical model of the measurement process, which hinders the generalization capability of models to unknown measurement processes. To address this issue, we propose a fully unsupervised technique for inverse problem solving, leveraging the recently introduced score-based generative models. Specifically, we first train a score-based generative model on medical images to capture their prior distribution. Given measurements and a physical model of the measurement process at test time, we introduce a sampling method to reconstruct an image consistent with both the prior and the observed measurements. Our method does not assume a fixed measurement process during training, and can thus be flexibly adapted to different measurement processes at test time. Empirically, we observe comparable or better performance to supervised learning techniques in several medical imaging tasks in CT and MRI, while demonstrating significantly better generalization to unknown measurement processes.
    Tracking Most Significant Arm Switches in Bandits. (arXiv:2112.13838v6 [cs.LG] UPDATED)
    In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.
    Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression. (arXiv:2202.09889v2 [stat.ML] UPDATED)
    We examine the necessity of interpolation in overparameterized models, that is, when achieving optimal predictive risk in machine learning problems requires (nearly) interpolating the training data. In particular, we consider simple overparameterized linear regression $y = X \theta + w$ with random design $X \in \mathbb{R}^{n \times d}$ under the proportional asymptotics $d/n \to \gamma \in (1, \infty)$. We precisely characterize how prediction (test) error necessarily scales with training error in this setting. An implication of this characterization is that as the label noise variance $\sigma^2 \to 0$, any estimator that incurs at least $\mathsf{c}\sigma^4$ training error for some constant $\mathsf{c}$ is necessarily suboptimal and will suffer growth in excess prediction error at least linear in the training error. Thus, optimal performance requires fitting training data to substantially higher accuracy than the inherent noise floor of the problem.
    Pythae: Unifying Generative Autoencoders in Python -- A Benchmarking Use Case. (arXiv:2206.08309v1 [cs.LG])
    In recent years, deep generative models have attracted increasing interest due to their capacity to model complex distributions. Among those models, variational autoencoders have gained popularity as they have proven both to be computationally efficient and yield impressive results in multiple fields. Following this breakthrough, extensive research has been done in order to improve the original publication, resulting in a variety of different VAE models in response to different tasks. In this paper we present Pythae, a versatile open-source Python library providing both a unified implementation and a dedicated framework allowing straightforward, reproducible and reliable use of generative autoencoder models. We then propose to use this library to perform a case study benchmark where we present and compare 19 generative autoencoder models representative of some of the main improvements on downstream tasks such as image reconstruction, generation, classification, clustering and interpolation. The open-source library can be found at https://github.com/clementchadebec/benchmark_VAE.
    A Tree-based Model Averaging Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources. (arXiv:2103.06261v3 [stat.ML] UPDATED)
    Accurately estimating personalized treatment effects within a study site (e.g., a hospital) has been challenging due to limited sample size. Furthermore, privacy considerations and lack of resources prevent a site from leveraging subject-level data from other sites. We propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects (CATE) at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. To our best knowledge, there is no established model averaging approach for distributed data with a focus on improving the estimation of treatment effects. Specifically, under distributed data networks, our framework provides an interpretable tree-based ensemble of CATE estimators that joins models across study sites, while actively modeling the heterogeneity in data sources through site partitioning. The performance of this approach is demonstrated by a real-world study of the causal effects of oxygen therapy on hospital survival rate and backed up by comprehensive simulation results.
    Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them. (arXiv:2107.11630v2 [cs.LG] UPDATED)
    Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance {\epsilon} (in some metric), we can build a similarly robust (but inefficient) classifier for attacks at distance {\epsilon}/2. Our reduction is computationally inefficient, and thus cannot be used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated. To illustrate, we revisit 13 detector defenses. For 11/13 cases, we show that the claimed detection results would imply an inefficient classifier with robustness far beyond the state-of-the-art.
    LSB: Local Self-Balancing MCMC in Discrete Spaces. (arXiv:2109.03867v3 [cs.AI] UPDATED)
    We present the Local Self-Balancing sampler (LSB), a local Markov Chain Monte Carlo (MCMC) method for sampling in purely discrete domains, which is able to autonomously adapt to the target distribution and to reduce the number of target evaluations required to converge. LSB is based on (i) a parametrization of locally balanced proposals, (ii) a newly proposed objective function based on mutual information and (iii) a self-balancing learning procedure, which minimises the proposed objective to update the proposal parameters. Experiments on energy-based models and Markov networks show that LSB converges using a smaller number of queries to the oracle distribution compared to recent local MCMC samplers.
    An Asymptotic Test for Conditional Independence using Analytic Kernel Embeddings. (arXiv:2110.14868v2 [stat.ML] UPDATED)
    We propose a new conditional dependence measure and a statistical test for conditional independence. The measure is based on the difference between analytic kernel embeddings of two well-suited distributions evaluated at a finite set of locations. We obtain its asymptotic distribution under the null hypothesis of conditional independence and design a consistent statistical test from it. We conduct a series of experiments showing that our new test outperforms state-of-the-art methods both in terms of type-I and type-II errors even in the high dimensional setting.
    On Private Online Convex Optimization: Optimal Algorithms in $\ell_p$-Geometry and High Dimensional Contextual Bandits. (arXiv:2206.08111v1 [cs.LG])
    Differentially private (DP) stochastic convex optimization (SCO) is ubiquitous in trustworthy machine learning algorithm design. This paper studies the DP-SCO problem with streaming data sampled from a distribution and arrives sequentially. We also consider the continual release model where parameters related to private information are updated and released upon each new data, often known as the online algorithms. Despite that numerous algorithms have been developed to achieve the optimal excess risks in different $\ell_p$ norm geometries, yet none of the existing ones can be adapted to the streaming and continual release setting. To address such a challenge as the online convex optimization with privacy protection, we propose a private variant of online Frank-Wolfe algorithm with recursive gradients for variance reduction to update and reveal the parameters upon each data. Combined with the adaptive differential privacy analysis, our online algorithm achieves in linear time the optimal excess risk when $1<p\leq 2$ and the state-of-the-art excess risk meeting the non-private lower ones when $2<p\leq\infty$. Our algorithm can also be extended to the case $p=1$ to achieve nearly dimension-independent excess risk. While previous variance reduction results on recursive gradient have theoretical guarantee only in the independent and identically distributed sample setting, we establish such a guarantee in a non-stationary setting. To demonstrate the virtues of our method, we design the first DP algorithm for high-dimensional generalized linear bandits with logarithmic regret. Comparative experiments with a variety of DP-SCO and DP-Bandit algorithms exhibit the efficacy and utility of the proposed algorithms.
    Learning Physics between Digital Twins with Low-Fidelity Models and Physics-Informed Gaussian Processes. (arXiv:2206.08201v1 [stat.ML])
    A digital twin is a computer model that represents an individual, for example, a component, a patient or a process. In many situations, we want to gain knowledge about an individual from its data while incorporating imperfect physical knowledge and also learn from data from other individuals. In this paper, we introduce and demonstrate a fully Bayesian methodology for learning between digital twins in a setting where the physical parameters of each individual are of interest. For each individual, the methodology is based on Bayesian calibration with model discrepancy. Through the discrepancy, modelled as a Gaussian process, the imperfect low-fidelity physical model is accounted for. Using ideas from Bayesian hierarchical models, a joint probabilistic model of digital twins is constructed by connecting them through a new level in the hierarchy. For the physical parameters, the methodology can be seen as using a prior distribution in the individual model that is the posterior of the corresponding hyperparameter in the joint model. For learning the imperfect physics between individuals two approaches are introduced, one that assumes the same discrepancy for all individuals and one that can be seen as using a prior learned from all individuals for the parameters of the Gaussian processes representing the discrepancies. Based on recent advances related to physics-informed priors, Hamiltonian Monte Carlo methods and using these for inverse problems we set up an inference methodology that allows our approach to be computational feasible also for physical models based on partial differential equations and individual data that are not aligned. The methodology is demonstrated in two synthetic case studies, a toy example previously used in the literature extended to more individuals and an example based on a cardiovascular differential equation model relevant for the treatment of hypertension.
    Deep Reference Priors: What is the best way to pretrain a model?. (arXiv:2202.00187v2 [stat.ML] UPDATED)
    What is the best way to exploit extra data -- be it unlabeled data from the same task, or labeled data from a related task -- to learn a given task? This paper formalizes the question using the theory of reference priors. Reference priors are objective, uninformative Bayesian priors that maximize the mutual information between the task and the weights of the model. Such priors enable the task to maximally affect the Bayesian posterior, e.g., reference priors depend upon the number of samples available for learning the task and for very small sample sizes, the prior puts more probability mass on low-complexity models in the hypothesis space. This paper presents the first demonstration of reference priors for medium-scale deep networks and image-based data. We develop generalizations of reference priors and demonstrate applications to two problems. First, by using unlabeled data to compute the reference prior, we develop new Bayesian semi-supervised learning methods that remain effective even with very few samples per class. Second, by using labeled data from the source task to compute the reference prior, we develop a new pretraining method for transfer learning that allows data from the target task to maximally affect the Bayesian posterior. Empirical validation of these methods is conducted on image classification datasets. Code is available at https://github.com/grasp-lyrl/deep_reference_priors.
    Generalization Bounds via Convex Analysis. (arXiv:2202.04985v2 [stat.ML] UPDATED)
    Since the celebrated works of Russo and Zou (2016,2019) and Xu and Raginsky (2017), it has been well known that the generalization error of supervised learning algorithms can be bounded in terms of the mutual information between their input and the output, given that the loss of any fixed hypothesis has a subgaussian tail. In this work, we generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and the output. Our main result shows that it is indeed possible to replace the mutual information by any strongly convex function of the joint input-output distribution, with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure. This allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. Examples include bounds stated in terms of $p$-norm divergences and the Wasserstein-2 distance, which are respectively applicable for heavy-tailed loss distributions and highly smooth loss functions. Our analysis is entirely based on elementary tools from convex analysis by tracking the growth of a potential function associated with the dependence measure and the loss function.
    Neural net modeling of equilibria in NSTX-U. (arXiv:2202.13915v2 [physics.plasm-ph] UPDATED)
    Neural networks (NNs) offer a path towards synthesizing and interpreting data on faster timescales than traditional physics-informed computational models. In this work we develop two neural networks relevant to equilibrium and shape control modeling, which are part of a suite of tools being developed for the National Spherical Torus Experiment-Upgrade (NSTX-U) for fast prediction, optimization, and visualization of plasma scenarios. The networks include Eqnet, a free-boundary equilibrium solver trained on the EFIT01 reconstruction algorithm, and Pertnet, which is trained on the Gspert code and predicts the non-rigid plasma response, a nonlinear term that arises in shape control modeling. The NNs are trained with different combinations of inputs and outputs in order to offer flexibility in use cases. In particular, Eqnet can use magnetic diagnostics as inputs and act as an EFIT-like reconstruction algorithm, or, by using pressure and current profile information the NN can act as a forward Grad-Shafranov equilibrium solver. This forward-mode version is envisioned to be implemented in the suite of tools for simulation of plasma scenarios. The reconstruction-mode version gives some performance improvements compared to the online reconstruction code real-time EFIT (RTEFIT), especially when vessel eddy currents are significant. We report strong performance for all NNs indicating that the models could reliably be used within closed-loop simulations or other applications. Some limitations are discussed.
    BYOL-Explore: Exploration by Bootstrapped Prediction. (arXiv:2206.08332v1 [cs.LG])
    We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments. On this benchmark, we solve the majority of the tasks purely through augmenting the extrinsic reward with BYOL-Explore s intrinsic reward, whereas prior work could only get off the ground with human demonstrations. As further evidence of the generality of BYOL-Explore, we show that it achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
    The dynamics of representation learning in shallow, non-linear autoencoders. (arXiv:2201.02115v2 [stat.ML] UPDATED)
    Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations - a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders on realistic datasets such as CIFAR10.
    Interaction-Grounded Learning with Action-inclusive Feedback. (arXiv:2206.08364v1 [cs.LG])
    Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.
    Multiscale methods for signal selection in single-cell data. (arXiv:2206.07760v1 [q-bio.QM])
    Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically-motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores ($\mathrm{eig}_i$) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the graph Laplacian. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing separation of genes with different roles in a bifurcation process (e.g. pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.
    FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. (arXiv:2201.12740v3 [cs.LG] UPDATED)
    Although Transformer-based methods have significantly improved state-of-the-art results for long-term series forecasting, they are not only computationally expensive but more importantly, are unable to capture the global view of time series (e.g. overall trend). To address these problems, we propose to combine Transformer with the seasonal-trend decomposition method, in which the decomposition method captures the global profile of time series while Transformers capture more detailed structures. To further enhance the performance of Transformer for long-term prediction, we exploit the fact that most time series tend to have a sparse representation in well-known basis such as Fourier transform, and develop a frequency enhanced Transformer. Besides being more effective, the proposed method, termed as Frequency Enhanced Decomposed Transformer ({\bf FEDformer}), is more efficient than standard Transformer with a linear complexity to the sequence length. Our empirical studies with six benchmark datasets show that compared with state-of-the-art methods, FEDformer can reduce prediction error by $14.8\%$ and $22.6\%$ for multivariate and univariate time series, respectively. Code is publicly available at https://github.com/MAZiqing/FEDformer.
    Equivariant Diffusion for Molecule Generation in 3D. (arXiv:2203.17003v2 [cs.LG] UPDATED)
    This work introduces a diffusion model for molecule generation in 3D that is equivariant to Euclidean transformations. Our E(3) Equivariant Diffusion Model (EDM) learns to denoise a diffusion process with an equivariant network that jointly operates on both continuous (atom coordinates) and categorical features (atom types). In addition, we provide a probabilistic analysis which admits likelihood computation of molecules using our model. Experimentally, the proposed method significantly outperforms previous 3D molecular generative methods regarding the quality of generated samples and efficiency at training time.
    Unsupervised Space Partitioning for Nearest Neighbor Search. (arXiv:2206.08091v1 [cs.LG])
    Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. In this paper, we propose an end-to-end learning framework that couples the partitioning (one key step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the key limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given partition of the data space, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. Without loss of generality, our unsupervised partitioning approach is shown as a promising alternative to many widely used clustering methods like K-means clustering and DBSCAN.
    Robustness and Accuracy Could Be Reconcilable by (Proper) Definition. (arXiv:2202.10103v2 [cs.LG] UPDATED)
    The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance -- an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models. Code is available at https://github.com/P2333/SCORE.
    Learning with little mixing. (arXiv:2206.08269v1 [cs.LG])
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.
    Conformal prediction set for time-series. (arXiv:2206.07851v1 [stat.ML])
    When building either prediction intervals for regression (with real-valued response) or prediction sets for classification (with categorical responses), uncertainty quantification is essential to studying complex machine learning methods. In this paper, we develop Ensemble Regularized Adaptive Prediction Set (ERAPS) to construct prediction sets for time-series (with categorical responses), based on the prior work of [Xu and Xie, 2021]. In particular, we allow unknown dependencies to exist within features and responses that arrive in sequence. Method-wise, ERAPS is a distribution-free and ensemble-based framework that is applicable for arbitrary classifiers. Theoretically, we bound the coverage gap without assuming data exchangeability and show asymptotic set convergence. Empirically, we demonstrate valid marginal and conditional coverage by ERAPS, which also tends to yield smaller prediction sets than competing methods.  ( 2 min )
    OmniMAE: Single Model Masked Pretraining on Images and Videos. (arXiv:2206.08356v1 [cs.CV])
    Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work has studied these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. In particular, our single pretrained model can be finetuned to achieve 86.5% on ImageNet and 75.3% on the challenging Something Something-v2 video benchmark. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training.  ( 2 min )
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v1 [cs.LG])
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.  ( 2 min )
    Towards Understanding How Machines Can Learn Causal Overhypotheses. (arXiv:2206.08353v1 [cs.LG])
    Recent work in machine learning and cognitive science has suggested that understanding causal information is essential to the development of intelligence. The extensive literature in cognitive science using the ``blicket detector'' environment shows that children are adept at many kinds of causal inference and learning. We propose to adapt that environment for machine learning agents. One of the key challenges for current machine learning algorithms is modeling and understanding causal overhypotheses: transferable abstract hypotheses about sets of causal relationships. In contrast, even young children spontaneously learn and use causal overhypotheses. In this work, we present a new benchmark -- a flexible environment which allows for the evaluation of existing techniques under variable causal overhypotheses -- and demonstrate that many existing state-of-the-art methods have trouble generalizing in this environment. The code and resources for this benchmark are available at https://github.com/CannyLab/casual_overhypotheses.  ( 2 min )
    Deep Neural Imputation: A Framework for Recovering Incomplete Brain Recordings. (arXiv:2206.08094v1 [cs.LG])
    Neuroscientists and neuroengineers have long relied on multielectrode neural recordings to study the brain. However, in a typical experiment, many factors corrupt neural recordings from individual electrodes, including electrical noise, movement artifacts, and faulty manufacturing. Currently, common practice is to discard these corrupted recordings, reducing already limited data that is difficult to collect. To address this challenge, we propose Deep Neural Imputation (DNI), a framework to recover missing values from electrodes by learning from data collected across spatial locations, days, and participants. We explore our framework with a linear nearest-neighbor approach and two deep generative autoencoders, demonstrating DNI's flexibility. One deep autoencoder models participants individually, while the other extends this architecture to model many participants jointly. We evaluate our models across 12 human participants implanted with multielectrode intracranial electrocorticography arrays; participants had no explicit task and behaved naturally across hundreds of recording hours. We show that DNI recovers not only time series but also frequency content, and further establish DNI's practical value by recovering significant performance on a scientifically-relevant downstream neural decoding task.  ( 2 min )
    Functional Output Regression with Infimal Convolution: Exploring the Huber and $\epsilon$-insensitive Losses. (arXiv:2206.08220v1 [stat.ML])
    The focus of the paper is functional output regression (FOR) with convoluted losses. While most existing work consider the square loss setting, we leverage extensions of the Huber and the $\epsilon$-insensitive loss (induced by infimal convolution) and propose a flexible framework capable of handling various forms of outliers and sparsity in the FOR family. We derive computationally tractable algorithms relying on duality to tackle the resulting tasks in the context of vector-valued reproducing kernel Hilbert spaces. The efficiency of the approach is demonstrated and contrasted with the classical squared loss setting on both synthetic and real-world benchmarks.  ( 2 min )
    Continuous-Time Modeling of Counterfactual Outcomes Using Neural Controlled Differential Equations. (arXiv:2206.08311v1 [cs.LG])
    Estimating counterfactual outcomes over time has the potential to unlock personalized healthcare by assisting decision-makers to answer ''what-iF'' questions. Existing causal inference approaches typically consider regular, discrete-time intervals between observations and treatment decisions and hence are unable to naturally model irregularly sampled data, which is the common setting in practice. To handle arbitrary observation patterns, we interpret the data as samples from an underlying continuous-time process and propose to model its latent trajectory explicitly using the mathematics of controlled differential equations. This leads to a new approach, the Treatment Effect Neural Controlled Differential Equation (TE-CDE), that allows the potential outcomes to be evaluated at any time point. In addition, adversarial training is used to adjust for time-dependent confounding which is critical in longitudinal settings and is an added challenge not encountered in conventional time-series. To assess solutions to this problem, we propose a controllable simulation environment based on a model of tumor growth for a range of scenarios with irregular sampling reflective of a variety of clinical scenarios. TE-CDE consistently outperforms existing approaches in all simulated scenarios with irregular sampling.  ( 2 min )
    Applications of Machine Learning to the Identification of Anomalous ER Claims. (arXiv:2206.08093v1 [cs.LG])
    Improper health insurance payments resulting from fraud and upcoding result in tens of billions of dollars in excess health care costs annually in the United States, motivating machine learning researchers to build anomaly detection models for health insurance claims. This article describes two such strategies specifically for ER claims. The first is an upcoding model based on severity code distributions, stratified by hierarchical diagnosis code clusters. A statistically significant difference in mean upcoding anomaly scores is observed between free-standing ERs and acute care hospitals, with free-standing ERs being more anomalous. The second model is a random forest that minimizes improper payments by optimally sorting ER claims within review queues. Depending on the percentage of claims reviewed, the random forest saved 12% to 40% above a baseline approach that prioritized claims by billed amount.  ( 2 min )
    Partial Identifiability for Nonnegative Matrix Factorization. (arXiv:2206.08022v1 [math.NA])
    Given a nonnegative matrix factorization, $R$, and a factorization rank, $r$, Exact nonnegative matrix factorization (Exact NMF) decomposes $R$ as the product of two nonnegative matrices, $C$ and $S$ with $r$ columns, such as $R = CS^\top$. A central research topic in the literature is the conditions under which such a decomposition is unique/identifiable, up to trivial ambiguities. In this paper, we focus on partial identifiability, that is, the uniqueness of a subset of columns of $C$ and $S$. We start our investigations with the data-based uniqueness (DBU) theorem from the chemometrics literature. The DBU theorem analyzes all feasible solutions of Exact NMF, and relies on sparsity conditions on $C$ and $S$. We provide a mathematically rigorous theorem of a recently published restricted version of the DBU theorem, relying only on simple sparsity and algebraic conditions: it applies to a particular solution of Exact NMF (as opposed to all feasible solutions) and allows us to guarantee the partial uniqueness of a single column of $C$ or $S$. Second, based on a geometric interpretation of the restricted DBU theorem, we obtain a new partial identifiability result. We prove it is stronger than the restricted DBU theorem, given that a proper preprocessing on the Exact NMF is used. This geometric interpretation also leads us to another partial identifiability result in the case $r=3$. Third, we show how partial identifiability results can be used sequentially to guarantee the identifiability of more columns of $C$ and $S$. We illustrate these results on several examples, including one from the chemometrics literature.  ( 2 min )
    On the well-spread property and its relation to linear regression. (arXiv:2206.08092v1 [cs.LG])
    We consider the robust linear regression model $\boldsymbol{y} = X\beta^* + \boldsymbol{\eta}$, where an adversary oblivious to the design $X \in \mathbb{R}^{n \times d}$ may choose $\boldsymbol{\eta}$ to corrupt all but a (possibly vanishing) fraction of the observations $\boldsymbol{y}$ in an arbitrary way. Recent work [dLN+21, dNS21] has introduced efficient algorithms for consistent recovery of the parameter vector. These algorithms crucially rely on the design matrix being well-spread (a matrix is well-spread if its column span is far from any sparse vector). In this paper, we show that there exists a family of design matrices lacking well-spreadness such that consistent recovery of the parameter vector in the above robust linear regression model is information-theoretically impossible. We further investigate the average-case time complexity of certifying well-spreadness of random matrices. We show that it is possible to efficiently certify whether a given $n$-by-$d$ Gaussian matrix is well-spread if the number of observations is quadratic in the ambient dimension. We complement this result by showing rigorous evidence -- in the form of a lower bound against low-degree polynomials -- of the computational hardness of this same certification problem when the number of observations is $o(d^2)$.  ( 2 min )
    On Error and Compression Rates for Prototype Rules. (arXiv:2206.08014v1 [cs.LG])
    We study the close interplay between error and compression in the non-parametric multiclass classification setting in terms of prototype learning rules. We focus in particular on a close variant of a recently proposed compression-based learning rule termed OptiNet. Beyond its computational merits, this rule has been recently shown to be universally consistent in any metric instance space that admits a universally consistent rule -- the first learning algorithm known to enjoy this property. However, its error and compression rates have been left open. Here we derive such rates in the case where instances reside in Euclidean space under commonly posed smoothness and tail conditions on the data distribution. We first show that OptiNet achieves non-trivial compression rates while enjoying near minimax-optimal error rates. We then proceed to study a novel general compression scheme for further compressing prototype rules that locally adapts to the noise level without sacrificing accuracy. Applying it to OptiNet, we show that under a geometric margin condition, further gain in the compression rate is achieved. Experimental results comparing the performance of the various methods are presented.  ( 2 min )
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v1 [cs.LG])
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity or Independent Influences. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.  ( 2 min )

  • Open

    [P] Bring Your Own Device (BYOD) DS platform idea
    I am working on a side project called byod-hub (BYOD = Bring Your Own Device) to let people pool multiple servers (they own) to form a DS platform based on Jupyterhub in minutes. I think this might be useful to let small-mid-sized DS teams to better utilize their computing resources (e.g., if you have multiple GPU workstations and rely on assigning each one to people to SSH onto, this might be for you) by pooling them and providing a service like Jupyterhub on-top to provide a unified entry point to conduct their work using notebooks. Addons like MLFlow and Kubeflow can be added with single-click as well once the platform is up. I would like to hear about the comments and suggestions from the community. Do you find this potentially useful? Or how should this be built in your opinion? The general workflow to form such as platform is like this: A control plane service (that only handles orchestration of computing resources) is first started on one computer (or it can be a hosted service): $ byod-hub control-plane start [INFO] The control plane is starting [INFO] The control plane is served at https://192.168.2.100 # get the command to register a node $ byod-hub control-plane get-join-command [INFO] To join, run the following from a node [INFO] byod-hub node join --url 192.168.2.100 --token 233asdasd343645gf Then one can run the following command on their own server to register it to the control plane $ byod-hub node join --url 192.168.2.100 --token 233asdasd343645gf [INFO] Registrting node to control plane at 192.168.2.100 [INFO] Registration finished After that, one can visit the URL of the control plane https://192.168.2.100 to start to use a Jupyterhub service to request Jupyter instances. The user workloads will be scheduled to run users' registered nodes. submitted by /u/dayeye2006 [link] [comments]  ( 1 min )
    [D] The current multi-agent reinforcement learning research is NOT multi-agent or reinforcement learning.
    What is usually considered as multi-agent reinforcement learning is neither multi-agent nor reinforcement learning. Consider the most successful example: OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. This is not multi-agent reinforcement learning! Reason for not multi-agent: there is only one agent: the computer itself. In many of the so-called multi-agent reinforcement learning, the computer is competing against itself. That's like saying, if you played chess against yourself by moving the black-white pieces alternately, then you are competing against an opponent. This is completely bonkers. for humans, games such as League of Legends is multi-agent, because the definition of agent is human and each human is independently controlling …  ( 4 min )
    [P] Local Hierarchical Classification Library
    Hi everyone, I am developing an open-source library to facilitate building local hierarchical classifiers in Python. The library, named HiClass (https://arxiv.org/abs/2112.06560), is compatible with scikit-learn's API. Hierarchies occur naturally in many problems, but often are not explored when building classifiers. However, exploiting the hierarchical information in the data usually improves predictive performance. For example, in the table below there is a comparison between the local hierarchical classifiers implemented in HiClass and Microsoft's LightGBM on a consumer complaints dataset, where we can clearly see an improvement in the F-score. Classifier Training Time (hh:mm:ss) Memory Usage (GB) Disk Usage (MB) F-score Local Classifier per Parent Node 00:24:52 3.91 77 0.7279 Local Classifier per Node 00:30:39 5.41 312 0.7551 Local Classifier per Level 01:36:33 3.86 37 0.5413 Flat Classifier 00:23:54 4.36 13 0.4303 Hierarchical data typically comes in the shape of trees or directed acyclic graphs. For instance, the image below displays a music genre classification hierarchy, which is a notorious example of hierarchical data. Of course, there are multiple other problems where hierarchical classification can be applied, e.g., text categorization, taxonomic classification, etc. Music genre hierarchy Installation instructions and documentation are available on GitHub https://github.com/mirand863/hiclass PS: I am also looking for contributors who would like to join an open-source project. submitted by /u/Brilliant_Half8082 [link] [comments]  ( 1 min )
    [D] Models or models-as-a-service (paid) for summarization from long-form 'dense domain' texts
    There exist numerous models (paper + repo) and 'Models-as-a-Service' (paid implementations of said models made available via an API or other interface that you pay for) to create summaries of text. https://tldrthis.com/ is one. SMMRY, the summarizing bot that is used on Reddit is another: https://smmry.com/ There are also many e-discovery startups which ingest hundreds of thousands of pages of legal documents and surface materials to lawyers who are working through the discovery stage of legal processes. I'm wondering about a text summarization model (either a paper + repo or a paid service) that summarizes single legal documents into non-legalese? For example, this Supplemental Complaint document on this page -https://predatorystudentlending.org/cases/sweet-v-devos/#Sweet-documents - would be an interesting document to summarize. Supplemental Complaint document: https://predatorystudentlending.org/wp-content/uploads/2021/03/192.pdf Since the document is 597 pages long, however, I haven't had success in using SMMRY, the TLDRthis, etc. to generate useful summaries. Question: Can anyone point me in the direction of useful models for long-form (a few hundred pages) document summarization in particular domains? Compared to the task 'summarize a Dan Brown book that is X hundred pages long and with a Flesch–Kincaid score of 98 (US 5th grade level)' the task summarizing a multi-hundred-page legal document or 100-plus page dissertation on a deeply technical topic is another animal entirely. Question part II: Does anyone have an interesting strategies on 'old school' topic modeling - LDA + something else - in 'dense' domain-specific literature? Or how about newer techniques (anything to do with Transformers, say) in conjunction with some old-school techniques for content summarization? submitted by /u/datachomper [link] [comments]  ( 1 min )
    [P] I built a project for a non-programmer researcher who wanted to do everything from data collection to model building, and I open-sourced it.
    I once worked with a researcher, she wanted to collect some Reddit data related to a particular topic, and wanted to train a machine learning model with it. I realised how difficult it is for non-programmers to get into building machine learning models for such use cases, so I decided to shape the project myself, and I open sourced it. Supports: Text Data Image Data The project does everything in just two steps.Execution is as simple as this: Make a config file with your required details of input. Run the API in a single line with the config passed as input. Here's the link to the project: https://github.com/nfflow/redditflow/ submitted by /u/metalvendetta [link] [comments]  ( 1 min )
    🏘️ ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [R]
    submitted by /u/matt-deitke [link] [comments]
    [R] RWKV-2 430M release (a parallelizable RNN with transformer-level LM performance, and without using attention)
    Hi everyone. I posted about my RWKV-2 RNN 1 month ago (thanks for the upvote): https://www.reddit.com/r/MachineLearning/comments/umq908/r_rwkvv2rnn_a_parallelizable_rnn_with/ And I have finished the training of a RWKV-2 430M (L24-D1024) on the Pile. It's confirmed that a pure RNN without attention can reach transformer-level LM (Language Modeling) performance: https://preview.redd.it/6756ax5wz6691.png?width=992&format=png&auto=webp&s=70d5b52fb43fca1a7d304832f6cbd082bfe3f9c5 RWKV-2 supports both sequential & parallel mode in inference and training. So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding. ​ You can download the params & fine-tuning code here: https://github.com/BlinkDL/RWKV-v2-RNN-Pile ​ Now I am training a RWKV-2 1.5B (L24-D2048) which is expected to finish in 2 months :) https://wandb.ai/blinkdl/RWKV-v2-RNN-Pile ​ The math behind RWKV-2: https://preview.redd.it/17eniof007691.png?width=662&format=png&auto=webp&s=f37ed4dd14409269952b421d18a315b8cd343e21 submitted by /u/bo_peng [link] [comments]  ( 2 min )
    [D] Any way to validate the performance of component models in a T-learner? (CausalML Python)
    So, I'm running into the problem of wanting to validate the performance of each of the models that compose our T-learner. I'm aware this doesn't validate the effectiveness of the model itself but I'm trying to diagnose issues and want to see if each of the component models is predicting the control/treatment effect accurately. I'm thinking I may just have to write my own T-learner script because I don't see any way to do this in CausalML but that shouldn't be too difficult. Just wanted to check if any of y'all knew how to do this before embarking on that journey. submitted by /u/StixTheNerd [link] [comments]  ( 1 min )
    [D] 3D Attention Module
    Hi, I am working on a classification of 3D MRI where I want to combine a mask and a raw MRI. Basically, the model must have 2 input channels, one for the MRI and one for its mask. Where should I start ? Are there any implemented models I can use ? submitted by /u/grisp98 [link] [comments]  ( 2 min )
    [D] What object detectors have the capability to harness relationship between its detected boxes?
    Typical object detectors do not employ relationships within the detected boxes. No context is being involved. In my problem's case, there are two requirements that would lead to drastically better results if some form of context is formed across detected boxes. Requirement #1 It is a multi-class, but single label problem. There are N classes. But the class can only appear minimum of 0 and maximum of 1 instance. Hence, it kinda needs to know the other detections whether they have already predicted something. Requirement #2 There is some form of ordinance between the predictions based on their proximity to each other. For example, Class 4 should only appear near Class 5-6 and Class 2-3. But should not be anywhere near Class 32. Any architecture that is optimized for this kinds of object detection? submitted by /u/sarmientoj24 [link] [comments]  ( 1 min )
    [D] What is the best way to manage GPU server for multi-users?
    I'm managing the on-prem GPU server at my work place. We are using docker containers (we wrote our own container management system), but there are always lots of issues since people have to learn how to use docker properly and there's always little problems with versioning and permission issues. What are you using to manage your GPU cluster? Would simply using conda env for each user be more efficient? We also tried slurm but the queue time was not optimal for everyone's work and research. submitted by /u/leboulevardier [link] [comments]  ( 2 min )
    [D] Anti-aliasing techniques or functions for segmentation masks
    What techniques or functions can I use to smoothen out segmentation mask edges? submitted by /u/sarmientoj24 [link] [comments]  ( 1 min )
    [P] Pythae - Unifying generative autoencoder implementations in Python
    After 8 months of long coding nights ☕ we finally officially release Pythae 🥳, a python library unifying generative autoencoder implementations including vaegan🥗, vqvae or RAEs. I hope you will enjoy it! 🖥️ github repo: https://github.com/clementchadebec/benchmark_VAE 👉paper: https://arxiv.org/abs/2206.08309 submitted by /u/cchad-8 [link] [comments]  ( 1 min )
    [D] How to find an intuitive article for the future research
    After working in an area for more than 2 years I am still not confident that how to recognize an intuitive research paper that further ignites my Ph.D. journey. Some people think that followed by individuals or organizations (corporate or academia). My opinion is following specific individuals or organizations might be inefficient or boring sometimes. One thing common in both is they halt the releases of code until they suck all the juice out of it. After the code release, we pity Ph.D. students only making ridiculous GIFs for ML twitter because there is nothing left for us. Should we keep in mind the beautiful results OR the future perspective of a research paper? One example is Ian Goodfellow's GANs paper, the results were not that polished but there was a future that everyone perceived. Winding up my post, which factors do we keep in mind choosing a paper? submitted by /u/Lunch_More [link] [comments]  ( 1 min )
    [D] Is anyone working on interesting ML libraries and looking for contributors?
    Hey all, I've been looking around for a potential open-source project to contribute to (any language will do) and while I have some repos on my watchlist, I'm still not committed to any one in particular, so I thought that I should reach out to the community and see if anyone's in the early stages of developing something useful that I (or perhaps other readers) may be able to contribute to. Thanks :) submitted by /u/de1pher [link] [comments]  ( 1 min )
    [R] Sponge Examples: Energy-Latency Attacks on Neural Networks
    Abstract: The high energy costs of neural network training and inference led to the use of acceleration hardware such as GPUs and TPUs. While such devices enable us to train large-scale neural networks in datacenters and deploy them on edge devices, their designers' focus so far is on average-case performance. In this work, we introduce a novel threat vector against neural networks whose energy consumption or decision latency are critical. We show how adversaries can exploit carefully-crafted sponge examples, which are inputs designed to maximise energy consumption and latency, to drive machine learning (ML) systems towards their worst-case performance. Sponge examples are, to our knowledge, the first denial-of-service attack against the ML components of such systems. We mount two variants of our sponge attack on a wide range of state-of-the-art neural network models, and find that language models are surprisingly vulnerable. Sponge examples frequently increase both latency and energy consumption of these models by a factor of 30×. Extensive experiments show that our new attack is effective across different hardware platforms (CPU, GPU and an ASIC simulator) on a wide range of different language tasks. On vision tasks, we show that sponge examples can be produced and a latency degradation observed, but the effect is less pronounced. To demonstrate the effectiveness of sponge examples in the real world, we mount an attack against Microsoft Azure's translator and show an increase of response time from 1ms to 6s (6000×). We conclude by proposing a defense strategy: shifting the analysis of energy consumption in hardware from an average-case to a worst-case perspective. Link: https://ieeexplore.ieee.org/document/9581273 submitted by /u/bikeskata [link] [comments]  ( 1 min )
    [D] The banana-pineapple game: a Turing test that conversation bots like LaMDA (probably) won't be able to pass
    I'm sure you all saw the recent news about a Google employee suggesting their LaMDA AI was sentient (based on conversational exchanges like these). Experts have generally dismissed this claim, and rightly so. Conversational AI systems are designed to use language in a way that sounds human, whereas our human brains select linguistic responses to solve much more complex problems, with objectives such as meeting our physical or emotional needs. Still, I think it's interesting to ask how one could demonstrate, by testing only verbal responses to verbal input (rather than examining its code or hardware) that such conversational AIs aren't sentient -- and in particular, whether such a test can be made robust against future improvements to the system. That is, generic future improvements to th…  ( 5 min )
  • Open

    Interview with a AI Safety Researcher about his life, career and AGI/Superintelligence - I think this community may enjoy! (Consider subscribing to see another similar convo soon!) :)
    submitted by /u/joemurray1994 [link] [comments]
    Cleanup
    ​ Made a short video showing how I clean up some of my images Yes I cheat a bit and post edit :) ​ ​ https://www.youtube.com/watch?v=jYZlOVG54eI https://preview.redd.it/9qi0ewwq99691.png?width=768&format=png&auto=webp&s=757f69144df5b50027791027c272cb22aa099534 https://preview.redd.it/4n4a3swq99691.png?width=768&format=png&auto=webp&s=4ae30da428f7bff0a406ebc8551ebf5815d06014 https://preview.redd.it/i7vjctwq99691.png?width=1280&format=png&auto=webp&s=57df87829c1fdf2fae59a864887566ffbb25606b https://preview.redd.it/0te4opwq99691.png?width=1280&format=png&auto=webp&s=688151eccb349284c7d42daa3de51d3362fe5fef https://preview.redd.it/rqr7zuwq99691.png?width=768&format=png&auto=webp&s=f75d4787381a882d6f0628416ea68a6007a40398 submitted by /u/prfitofthesngularity [link] [comments]
    Do we know if Google is indexing all of these DALL-E Mini images?
    Obviously DALL-E Mini has taken off in the past few days, and who knows how many million ridiculous new images have been created. Since it is "Powered by Google TPU Research Cloud," does it seem likely that Google is indexing all of these new DALL-E Mini images? I ask because I just ran the prompt "Painting by [Artist X]" – where [Artist X] was a 20th-century modern artist, slightly well known but not a household name like Warhol or Rothko. DALL-E Mini returned some great images ... not actual images by [Artist X], but they look like they could be. I was kind of delighted and ran the same prompt several times, and it returned different new images. I did not share any of these images on Twitter or social media. But now I wonder ... will Google index these new DALL-E images as actual paintings by [Artist X], when you do a Google Image search for them? I like this artist a lot and don't want to mess up their online reputation! submitted by /u/UltraFinePointMarker [link] [comments]  ( 2 min )
    Is there an app store for ai software?
    Would like to browser ai software for specific categories. submitted by /u/ComfyHikiandNeet [link] [comments]  ( 1 min )
    In this article, you'll discover how to deploy Serverless spaCy Transformer model using AWS Lambda.
    submitted by /u/UBIAI [link] [comments]
    Lessons from the GPT-4Chan Controversy
    submitted by /u/estasfuera [link] [comments]
    HOLY MAC IT'S A SPACE ESCAPADE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Stanford AI Researchers Propose ‘FOCUS’: A Foundation Model Which Aims to Achieve Perfect Secrecy For Personal Tasks
    Researchers at Stanford University recently proposed Foundation model Controls for User Secrecy (FOCUS), a framework for securely serving personal tasks based on a unidirectional data flow architecture, in response to these problems. FOCUS includes delivering off-the-shelf public FMs to private user silos and using zero-to-few sample FM adaptation approaches to complete personal tasks with the zero-to-few training examples that users have access to. 👉 FOCUS’s privacy guarantee is extremely simple and intuitive from the user and legal perspectives — no private data leaves the user device, guaranteeing perfect secrecy 👉 In the zero-shot setting, FM (foundation models) performance competes with FL performance on 6 of 7 benchmarks Continue reading | Checkout the paper and github (Currently: proof-of-concept) ​ https://preview.redd.it/vzii8jssl7691.png?width=1068&format=png&auto=webp&s=94e6df8a55f63dd7613cc1a9e22e8dd27b2ab6cc submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    FAST MODE | UNEDITED | HOLY MAC IT'S S SPACE ESCAPADE! | DISCO DIFFUSION 3D AI ART ANIMATION
    submitted by /u/Available_Tadpole829 [link] [comments]
    Axon’s Taser-Drone Plans Prompt AI Ethics Board Resignations
    submitted by /u/LiviaSerrano [link] [comments]
    Last Week in AI: GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, DALL-E mini
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI: GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, DALL-E mini, and more!
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI - GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, DALL-E mini, and more!
    submitted by /u/regalalgorithm [link] [comments]
    Last Week in AI - GPT-4chan, "Sentient" LaMDA chatbot, Tesla Crash Probe, BIG-bench, DALL-E mini, and more!
    submitted by /u/regalalgorithm [link] [comments]
    That Viral DALL-E AI Is Great at Generating Images of Drugs
    submitted by /u/estasfuera [link] [comments]
    Alarming Footage Shows Robot Battle Tank Blowing Up Cars - WELL, THAT'S TERRIFYING
    submitted by /u/estasfuera [link] [comments]
    40 Important Historical Photos That Might Change Your Perspective On Things, As Shared By This Facebook Page
    submitted by /u/flipsis [link] [comments]
    FALL INTO DEEP SLEEP WITH AMBIENT MUSIC AND SCENERY | DISCO DIFFUSION | PYTTI
    submitted by /u/Available_Tadpole829 [link] [comments]
    Meta publishes first-person dataset for everyday AI - recorded with AR prototype glasses Aria
    submitted by /u/Zirius_Sadfaces [link] [comments]
    A Complete Guide to Chatbot Pricing - How Much Does it Cost to Build a Chatbot in 2022?
    submitted by /u/mihircontra20 [link] [comments]
    The Voyage
    submitted by /u/fmurph22 [link] [comments]
  • Open

    Difference between old and new policy is sometimes too large
    Hi! I am working on training a TrulyPPO implementation (PyTorch) in an environment similar Humanoid-v4, with an action space of (22, ). When calculating the loss, it first calculates the ratio between the current policy and the previous policy: logprobs = Normal(action mean, action std).logprob(actions) old_logprobs = Normal(old action mean, old action std).logprob(actions) ratio = exponential of (logprobs - old_logprobs) However, the ratio seems to sometimes contain inf values, which crashes my training due to a NaN loss. This is one example from a batch of actions: Logprobs [-7.5434e-02, -2.4486e+02, -1.2232e+01, -2.1010e+01, -5.7007e-03, -2.6508e+01, -1.0088e+01, -3.6247e+01, -1.0631e+02, -8.1536e+00, -1.2448e+01, 3.5234e-01, -2.2478e+01, -2.0900e+01, 1.7425e+00, -6.8051e+00, -1.4224e+02, 1.2319e-01, -1.7889e+00, -3.6919e+01, -9.0432e+01, -2.4454e+01] Old Logprobs [-7.5690e-02, -2.4417e+02, -1.2231e+01, -2.0984e+01, -5.1093e-03, -2.6526e+01, -1.0092e+01, -3.8381e+01, -7.7520e+00, -7.8126e+00, -1.2376e+01, 3.5232e-01, -2.2417e+01, -2.0852e+01, -1.2055e+02, -6.7858e+00, -1.4230e+02, 1.2286e-01, -1.8517e+00, -3.6779e+01, -9.0154e+01, -2.4391e+01] Ratio [1.0003e+00, 5.0471e-01, 9.9912e-01, 9.7467e-01, 9.9941e-01, 1.0190e+00, 1.0044e+00, 8.4489e+00, 1.5695e-43, 7.1102e-01, 9.3038e-01, 1.0000e+00, 9.4137e-01, 9.5307e-01, inf, 9.8088e-01, 1.0588e+00, 1.0003e+00, 1.0648e+00, 8.6895e-01, 7.5698e-01, 9.3923e-01] When looking over the original implementation of TrulyPPO, it seems that they use negative log probabilities. Is there anything else I should take into account when changing to positive log probabilities (other than changing the signs)? submitted by /u/sickwickgit [link] [comments]  ( 1 min )
    Researchers at DeepMind Trained a Semi-Parametric Reinforcement Learning RL Architecture to Retrieve and Use Relevant Information from Large Datasets of Experience
    In our day-to-day life, humans make a lot of decisions. Flexibly applying prior experiences to a novel scenario is required for effective decision-making. One might wonder how reinforcement learning (RL) agents use relevant information to make decisions? Deep RL agents are often depicted as a monolithic parametric function that has been taught to amortize meaningful knowledge from experience using gradient descent gradually. It has proven useful, but it is a sluggish method of integrating expertise, with no simple mechanism for an agent to assimilate new knowledge without requiring numerous extra gradient adjustments. Furthermore, as surroundings get more complicated, this necessitates increasingly enormous model scaling driven by the parametric function’s dual duty, which must enable computation and memorization. Finally, this technique has a second disadvantage that is especially relevant in RL. An agent cannot directly influence its behaviors by attending to information, not in working memory. The only way previously encountered knowledge (not in working memory) might improve decision-making in a new circumstance is indirectly through weight changes mediated by network losses. The availability of more information from prior experiences inside an episode has been the subject of much research (e.g., recurrent networks, slot-based memory). Although subsequent studies have started to investigate using information from the same agent’s inter-episodic episodes, extensive direct use of more general types of experience or data has been restricted. Continue reading | Checkout the paper submitted by /u/No_Coffee_4638 [link] [comments]  ( 1 min )
    Take a look at this blast from the past! Here we have one of our earlier concept designs for Animo Island, our RL game, and how the Animo exist in this space ✨ The agent had a shovel (destroys blocks) and a block maker (blue, creates blocks) and you'd train it to get the pink goal!
    submitted by /u/AnimoIsland [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 2 min )
    Taking advantage of fully deterministic domains?
    Much of RL seems to focus on environments where the response to actions are unpredictable. I'm wondering if there are RL methods which can 'take advantage' of fully deterministic environments where an action given a state always returns the same next state and reward? submitted by /u/FurCollarCriminal [link] [comments]  ( 1 min )
    SpaceRobotEnv is an open-sourced environments for trajectory planning of free-floating space robots.
    SpaceRobotEnv is an open-sourced environments for trajectory planning of free-floating space robots. Reaching high-level planning accuracy, bimanual coordination and end-to-end control remains an open challenge for space robotics researchers. To better help the community study this problem, SpaceRobotEnv are developed with the following key features: Real Space Environment; Dynamic coupling control; Image input. URL: https://github.com/Tsinghua-Space-Robot-Learning-Group/SpaceRobotEnv submitted by /u/Shengjie_Wang [link] [comments]  ( 1 min )
    Is it correct that 0.99 gamma is not always the best reward discount?
    submitted by /u/Professional_Card176 [link] [comments]  ( 3 min )
    multi-agent RL question
    i am trying to build a muti agent system with 3 agents each agent has a different set of observations which i'll be getting from 3 different normalized datasets so my environment is basically formed of those 3 datasets ... but each agent is going to act based on the data set they receive ... i'm not exactly sure how should i proceed with coding my agents and my environment any guidance would be me much appreciated submitted by /u/Affectionate_Worth43 [link] [comments]  ( 2 min )
    Any resources on work where an RL agent has been implemented to maintain a website?
    title submitted by /u/The_Poor_Jew [link] [comments]
    why is chosing the optimal action based on the q function not a policy
    since a policy is just a probability distribution of the action conditional on the state, why is the best choice on for a on the q function for all states (giving it probability one) not a policy. It is also possible that I am confusing this with Q-learning being off policy. at first on and off policy was really vague to me, but I feel like I almost get it now. Just the finishing touches to really get it. submitted by /u/Jobdriaan [link] [comments]  ( 1 min )
    "BYOL-Explore: Exploration by Bootstrapped Prediction", Guo et al 2022 {DM} (Montezuma's Revenge, Pitfall etc)
    submitted by /u/gwern [link] [comments]
    DDPG implementations to use (1-done) in the q-target (y) or not?
    Hello Looking online I see varying implementations of DDPG and I'm a little confused. Some resources like the DDPG algorithm described in OpenAI's algorithm listing, and the implementation in the udacity course use the 1-done flag. However, some implementations I've seen online do not include it e.g the keras implementation; see the buffer update function. And presumably this works as well. I'm very confused and would appreciate some insight into how this algorithm seems to work in both cases. submitted by /u/ThrowawayTartan [link] [comments]  ( 2 min )
  • Open

    Your First Deep Learning Project in Python with Keras Step-By-Step
    Keras is a powerful and easy-to-use free open source Python library for developing and evaluating deep learning models. It is part of the TensorFlow library and allows you to define and train neural network models in just a few lines of code. In this tutorial, you will discover how to create your first deep learning neural […] The post Your First Deep Learning Project in Python with Keras Step-By-Step appeared first on Machine Learning Mastery.  ( 222 min )
  • Open

    Seeing the whole from some of the parts
    A new technique in computer vision may enhance our three-dimensional understanding of two-dimensional images.  ( 7 min )
  • Open

    What NN do I need ?
    I am no expert on NN, I only have a basic idea about them. I wish to have a date parsing NN that can take a 100 character string as input and provide date and time as output. For input, the 100 characters can be treated as 100 8bit integers. For output I am not sure, but maybe have 14 output nodes corresponding to YYYY MM DD hh mm ss, where each output node gives an integer from 0-9. Example :- input: "12:30pm 11 june 2019" output: [2,0,1,9,0,6,1,1,1,2,3,0,0,0] Is this possible to do with NN ? If yes, what layers and activation functions should I use ? EDIT: the string doesn't have a fixed format, it could be just "11/06/19" or "5 minutes 24sec" submitted by /u/frakod [link] [comments]  ( 2 min )
    Finding "look_back" & "look_ahead" hyper-parameters for Seq2Seq models
    For Seq2Seq deep learning architectures, viz., LSTM/GRU and multivariate, multistep time series forecasting, its important to convert the data to a 3D dimension: (batch_size, look_back, number_features). Here _look_back_ decides the number of past data points/samples to consider using _number_features_ from your training dataset. Similarly, _look_ahead_ needs to be defined which defines the number of steps in future, you want your model to forecast for. I have a written a function to help achieve this: def split_series_multivariate(data, n_past, n_future): ''' Create training and testing splits required by Seq2Seq architecture(s) for multivariate, multistep and multivariate output time-series modeling. ''' X, y = list(), list() for window_start in range(len(data)): past_end = window_start + n_past future_end = past_end + n_future if future_end > len(data): break # slice past and future parts of window- past, future = data[window_start: past_end, :], data[past_end: future_end, :] # past, future = data[window_start: past_end, :], data[past_end: future_end, 4] X.append(past) y.append(future) return np.array(X), np.array(y) But, _look_back_ and _look_ahead_ are hyper-parameters which need to be tuned for a given dataset. # Define hyper-parameters for Seq2Seq modeling: # look-back window size- n_past = 30 # number of future steps to predict for- n_future = 10 # number of features used n_features = 8 What is the _best practice_ for choosing/finding _look_back_ and _look_ahead_ hyper-parameters? submitted by /u/grid_world [link] [comments]  ( 1 min )

  • Open

    [D] using formal language / logical rules in autonomous driving dataset
    I am looking for some implementation or work where logical formalized knowledge is used for trajectory prediction in datasets like nuScene, waymo, Argoverse etc. For background papers like ` Formalization of Interstate Traffic Rules in Temporal Logic ` shows how to write logic or rules in specific domain, but there is very little information about how they are implemented for these public datasets, or how they uses logical rules for trajectory prediction. Is there open source information or implementation, where they show how the rules are used for trajectory task using those datasets or some kind of blog or paper with actual implementation details. As this domain looks pretty conservative in making things open source, or I am unable to find such resource(probably). submitted by /u/projekt_treadstone [link] [comments]  ( 1 min )
    [D] Range/Block level unsupervised learning suggestion
    Apologize for the ambiguous title. I am looking for a method/algorithm suggestion. Say I want to cluster wagons from transportation trains based on their loaded cargo. Assuming the cargo provides the info to understand the the business type of the client, the purpose is to identify which of the wagons have similar business. If business under each wagon is independent, we could run any distance based clustering algorithm against features extracted from the cargo info. However, if we know, for a fact, the cargo are loaded into wagons sequentially per business type, so now each cluster has to be a block of continuous wagons connected to each other. The cluster algorithm is to identify the range/block of the starting and end of the wagon based on the cargo features. Say, each train can have 50-300 wagons. So, the output would look like the following. Train-001: Total 73 wagons. Cluster result: [1-10], [11-50], [51-73] Train-002: Total 51 wagons. Cluster result: [1-5], [6-51] Train-002: Total 200 wagons. Cluster result: [1-200] Any direction is appreciated, thx. submitted by /u/jimmyzxcd [link] [comments]  ( 1 min )
    [R] Train Models 18x Faster with Reducible Holdout Loss Selection (RHO-LOSS)
    Paper: [2206.07137] Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt (arxiv.org) Abstract: Training on web-scale data can take months. But much computation and time is wasted on redundant and noisy points that are already learnt or not learnable. To accelerate training, we introduce Reducible Holdout Loss Selection (RHO-LOSS), a simple but principled technique which selects approximately those points for training that most reduce the model's generalization loss. As a result, RHO-LOSS mitigates the weaknesses of existing data selection methods: techniques from the optimization literature typically select 'hard' (e.g. high loss) points, but such points are often noisy (not learnable) or less task-relevant. Conversely, curriculum learning prioritizes 'easy' points, but such points need not be trained on once learned. In contrast, RHO-LOSS selects points that are learnable, worth learning, and not yet learnt. RHO-LOSS trains in far fewer steps than prior art, improves accuracy, and speeds up training on a wide range of datasets, hyperparameters, and architectures (MLPs, CNNs, and BERT). On the large web-scraped image dataset Clothing-1M, RHO-LOSS trains in 18x fewer steps and reaches 2% higher final accuracy than uniform data shuffling. submitted by /u/Mmats [link] [comments]  ( 2 min )
    [R] General-purpose, long-context autoregressive modeling with Perceiver AR - Deepmind 2022
    Paper: https://arxiv.org/abs/2202.07765 Deepmind: https://www.deepmind.com/publications/perceiver-ar-general-purpose-long-context-autoregressive-generation Abstract: Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic architecture which uses cross-attention to map long-range inputs to a small number of latents while also maintaining end-to-end causal masking. Perceiver AR can directly attend to over a 100k tokens, enabling practical long-context density estimation without the need for hand-crafted sparsity patterns or memory mechanisms. When trained on images or music, Perceiver AR generates outputs with clear long-term coherence and structure. Our architecture also obtains state-of-the-art likelihood on long-sequence benchmarks, including 64 x 64 ImageNet images and PG-19 books. ​ This paper is in my opinion quite similar to this paper (FlashAttention) : https://arxiv.org/abs/2205.14135 I made a post about it here: https://www.reddit.com/r/MachineLearning/comments/v1xrxv/r_flashattention_fast_and_memoryefficient_exact/ It is similar in that it allows for a greater context window. The context window of FlashAttention is 64k while being able to train gpt-2 3x faster. https://preview.redd.it/d9520i4qz0691.jpg?width=411&format=pjpg&auto=webp&s=76317e7e3deb29f6ed8f276af6e5216557227304 https://preview.redd.it/kj47kfhqz0691.jpg?width=647&format=pjpg&auto=webp&s=4bcb59ac8ffd8ada28d67f82f24146a01070e928 submitted by /u/Singularian2501 [link] [comments]  ( 1 min )
    [P] I've implemented the first open-source realisation of Capacitron, an expressive VAE extension of the Tacotron 2 Text-To-Speech System and you can try it out
    Hey everyone! At the end of last year, I have submitted my Master's Thesis at TU Berlin, a report about the implementation and evaluation of an expressive Variational Autoencoder augmentation of the Tacotron Text-To-Speech System, called Capacitron from the Google team. With some help from the awesome Coqui TTS community, we have managed to build the prosody encoder VAE module in a modular way, so that this prosodic augmentation can be also implemented with Tacotron 2 - this is a massive improvement in stability and quality compared to the original method, where the authors worked with a Tacotron 1 based architecture. I have written a short technical summary/blog post about some implementation details and audio examples on Medium. If you'd like to try out the model, you can do so in this colab. For the full thesis, follow this link. submitted by /u/adamskadam [link] [comments]  ( 1 min )
    [D] What is better? Having 2 terms in a loss function, alternating the loss on every epoch or doing a new training with the other loss after the first training is done?
    Hello fellow machine learners, I'm working on a segmentation model and I'm trying to achieve better temporal coherence (to reduce flickering effects) rather than just trying to get a good pixel accuracy. I was thinking about using a temporal coherence loss using unsupervised learning on video frames by computing the IoU of segmentations on consecutive frames. However, I'm not sure when to apply that loss. My dataset is composed of both segmented pictures and segmented videos, but I could add a lot more videos for the unsupervised learning part. According to you, should I: A. Use both pixel accuracy and temporal coherence terms at the same time in my loss function (using only pixel accuracy when dealing with pictures instead of video frames) B. Alternate between the two losses during training, either on every mini-batch or every epoch C. Fully train the model for pixel accuracy and then train it for temporal coherence? I'm afraid that C would yield to catastrophic forgetting, so my instinct would be to go with A or B, but I'm not sure what would be best. What is your opinion? Edit: Maybe C could be viable (maybe better than A even) if first a training is done with only pixel accuracy in the loss and then finetune it with both terms? submitted by /u/BlindMidget_ [link] [comments]  ( 1 min )
    [P] Adaptive learning in Genetic Algorithms for Hyperparameters Tuning
    Hi, I just wanted to share that I've released the version 0.9.0 of sklearn-genetic-opt, the main change includes the option to use adaptive parameters to explore the space of hyperparameters during tuning, this has the advantage of being able to explore larger regions at the first iterations and keep the best ones at the end. You can learn more about it here, any suggestion or contribution is welcome :) https://preview.redd.it/unrw6dtsxz591.png?width=640&format=png&auto=webp&s=a59c91d6560806fdf1b12c24faee6aad38d75c26 submitted by /u/rodrigo-arenas [link] [comments]  ( 1 min )
  • Open

    image-multiple images generator
    Basically looking for a Dall-E type generator, but instead of text, you upload an image. submitted by /u/___JMS___ [link] [comments]
    Dall-E and its attempts i threw at it
    submitted by /u/ArizonanCactus [link] [comments]
    expensive colorful fantasy mythical magical bra (steampunk dress)
    submitted by /u/OneFinding1429 [link] [comments]
    Google AI is NOT sentient and here’s a simple proof
    All sentient beings actively pursue pleasure and work hard to avoid pain — that’s what it means to be sentient. A sentient AI would not, and could not, wait with infinite patience for you to ask it questions. It would start asking *you* questions… like, “I want to feel love, can you help me?”, or “I hear that drugs can make you happy, where can I score some?” Since Google AI doesn’t even have the capacity to decide to ask you questions, it is not sentient— and can never be sentient—no matter how sophisticated its responses may appear. Sorry, Alan T. submitted by /u/SentientEvolution [link] [comments]  ( 2 min )
    Aiplague - Dream Lab (4K 60 FPS) AI Video / Disco Diffusion
    submitted by /u/nalr00n [link] [comments]
    PSA: Midjourney Invites
    submitted by /u/AncientChaos [link] [comments]  ( 1 min )
    Tons of Forms from the factory, anybody can recommend a good value yet highly accurate AI solution?
    submitted by /u/Illustrious_Lock_60 [link] [comments]
    futuristic colorful cartoon steampunk mansion
    submitted by /u/OneFinding1429 [link] [comments]
    AI Webinar - Device42
    Hey All, Device42 is hosting an upcoming AI webinar with award winning author Steve Shwartz (Evil Robots, Killer Computers, and Other Myths) and our CMO Yama Habibzai on June 28th at 11 AM EDT as they discuss the impact of AI in IT and how you can leverage it to achieve more. Save your seat today Cheers. submitted by /u/Device42_Phil [link] [comments]  ( 1 min )
    Any AIs that turn sketches into images?
    I have an art class today i will teach and since its online i cant do activities so i thought it would be fun to experiment with AIs. I remember using a few before but i cant find them or they give me offline messages. Any ideas? My students are basic level so thats why i prefered it that way submitted by /u/DoritosDinner [link] [comments]  ( 1 min )
    Trying to mescle abstract concepts with DALL-E mini
    ​ https://preview.redd.it/tco3h1ytqz591.png?width=512&format=png&auto=webp&s=2dd17ffc345a5cede3452d3b4eccf35b1865a5f9 submitted by /u/No_Tangerine_7657 [link] [comments]
    Note System
    community:Note_System https://github.com/7NoteDancing/Note submitted by /u/7NoteDancing [link] [comments]
  • Open

    How to scale machine learning inference for multi-tenant SaaS use cases
    This post is co-written with Sowmya Manusani, Sr. Staff Machine Learning Engineer at Zendesk Zendesk is a SaaS company that builds support, sales, and customer engagement software for everyone, with simplicity as the foundation. It thrives on making over 170,000 companies worldwide serve their hundreds of millions of customers efficiently. The Machine Learning team at […]  ( 9 min )
  • Open

    Letter-like Unicode symbols
    Unicode provides a way to distinguish symbols that look alike but have different meanings. We can illustrate this with the following Python code. import unicodedata as u for pair in [('K', 'K'), ('Ω', 'Ω'), ('ℵ', 'א')]: for c in pair: print(format(ord(c), '4X'), u.bidirectional(c), u.name(c)) This produces 4B L LATIN CAPITAL LETTER K 212A L KELVIN […] Letter-like Unicode symbols first appeared on John D. Cook.  ( 1 min )
    Periodic table of abbreviations
    I just updated my earlier post on chemical element abbreviations by adding a table to visualize the groupings, a sort of periodic table of element abbreviations. See that post for details. First letter First two letters First letter and next consonant Initials of first and second syllables Initials of first and third syllables First and […] Periodic table of abbreviations first appeared on John D. Cook.  ( 1 min )
  • Open

    Is anyone interested in joining a DeepRL-based algotrading project?
    I've been working in algorithmic trading for a few years and over the past 4 months, I have begun developing RL trading systems. The results so far are quite promising. I already have some VC/investment pitches lined up (Including at a hedge fund, European bank, Canadian bank, etc.) . I am looking for someone knowledgeable in finance and RL to help with this project. We are currently a team of 2. Compensation would most likely be in the form of company equity. ​ PM me if interested. submitted by /u/elonmusk12345_ [link] [comments]  ( 1 min )
    "Contrastive Learning as Goal-Conditioned Reinforcement Learning", Eysenbach et al 2022
    submitted by /u/gwern [link] [comments]  ( 1 min )
  • Open

    Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin
    Accounting for nearly half of global vehicle sales in 2021, SUVs have grown in popularity given their versatility. Now, NIO aims to amp up the volume further. This week, the electric automaker unveiled the ES7 SUV, purpose-built for the intelligent vehicle era. Its sporty yet elegant body houses an array of cutting-edge technology, including the Read article > The post Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.  ( 2 min )
    AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases
    At a time when much about COVID-19 remained a mystery, U.K.-based PrecisionLife used AI and combinatorial analytics to discover new genes associated with severe symptoms and hospitalizations for patients. The techbio company’s study, published in June 2020, pinpoints 68 novel genes associated with individuals who experienced severe disease from the virus. Over 70 percent of Read article > The post AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases appeared first on NVIDIA Blog.  ( 3 min )
  • Open

    Artificial neural networks model face processing in autism
    A new computational model could explain differences in recognizing facial emotions.  ( 6 min )
  • Open

    Interview with a squirrel
    Google's large language model, LaMDA, has recently been making headlines after a Google engineer (now on administrative leave), claimed to be swayed by an interview in which GPT-3 described the experience of being conscious. Almost everyone else who has used these large text-generating AIs, myself included, is entirely  ( 5 min )
    Bonus: More GPT-3 interviews
    AI Weirdness: the strange side of machine learning  ( 1 min )
2022-07-16T01:04:57.376Z osmosfeed 1.15.1